Measuring AI Review Value with Copilot Code Review Metrics: KPI Design by security and bug_risk¶
Audience: Engineering teams measuring AI code review
Engineering managers, SREs, and platform teams that use GitHub Copilot code review and need a practical way to measure whether AI review is working.
Key Points¶
- On May 8, 2026, GitHub added
comment_type-level Copilot code review suggestion metrics. - Labels such as
securityandbug_riskare Copilot-assigned categories, not severity ratings. Treat them as review-signal buckets. - Repository-level breakdowns are not available yet, so the first step is an Enterprise or Organization baseline over 4 to 12 weeks.
What Changed¶
On May 8, 2026, GitHub added copilot_suggestions_by_comment_type to the Copilot usage metrics API.1 This makes it possible to aggregate Copilot code review suggestions by comment_type.
Each entry has three values that matter for KPI design.
comment_type: the type label assigned by Copilottotal_copilot_suggestions: the number of Copilot suggestions posted for that typetotal_copilot_applied_suggestions: the number of Copilot suggestions that developers applied
This is not a new one-call dashboard that explains all review value. More precisely, Copilot usage metrics reports now include a new breakdown field for Copilot review suggestions.
The most important caveat is that security and bug_risk are not severity ratings. They are Copilot-assigned categories. In KPI work, read them first as "what kind of review signal Copilot produced" and "how often developers applied that signal."
Why This KPI Helps¶
AI review metrics often fail because aggregate counts are too vague. More suggestions can mean better coverage, but it can also mean more false positives.
Once the data is grouped by comment_type, better questions become possible.
- Are
securitysuggestions increasing while applied rate is falling? - Is
bug_riskhighly applied but too rare to matter? - Did one category spike after a policy or configuration change?
- Did the mix of accepted suggestions change after a team changed its review rules?
The central metric is applied rate by type.
applied rate by type = total_copilot_applied_suggestions / total_copilot_suggestions
This is close to a practical proxy for whether developers considered that suggestion type worth applying. It is not a perfect value metric. A developer may apply a trivial suggestion just to resolve a comment, and a valid suggestion may be rejected for architectural reasons. Pair it with review duration, follow-up comments, and rework signals.
KPI Design by comment_type¶
Do not start with a complex composite score. A useful baseline needs four metrics by Enterprise or Organization scope and by comment_type.
| Metric | Formula | Use |
|---|---|---|
| Suggestion density | Suggestions by type / reviewed PRs | Shows where Copilot reacts |
| Applied rate by type | Applied suggestions / suggestions | Shows how often developers adopt the signal |
| Suggestion mix | Suggestions by type / all suggestions | Shows the distribution of review output |
| Applied mix | Applied suggestions by type / all applied suggestions | Shows where accepted value concentrates |
The strongest view is a two-axis map: suggestion density and applied rate.
| State | Interpretation | Next action |
|---|---|---|
| High density, high applied rate | Core category | Put it in the standard KPI set |
| High density, low applied rate | Possible false-positive cluster | Review paths, rules, and review triggers |
| Low density, high applied rate | Rare but useful signal | Expand coverage or target repositories |
| Low density, low applied rate | Low current value | Observe before optimizing |
Repository-level breakdown is not part of this release. That means threshold design should wait. Collect 4 to 12 weeks of Enterprise or Organization data first, then set local baselines.
Minimal Collection Script¶
The Copilot usage metrics REST API returns download_links for report files rather than the report body itself.2 The minimal implementation is: fetch the latest 28-day report links, download the report files, then aggregate pull_requests.copilot_suggestions_by_comment_type.
import json
import os
from collections import defaultdict
import requests
enterprise = os.environ["GITHUB_ENTERPRISE"]
token = os.environ["GITHUB_TOKEN"]
base = "https://api.github.com"
endpoint = f"{base}/enterprises/{enterprise}/copilot/metrics/reports/enterprise-28-day/latest"
headers = {
"Authorization": f"Bearer {token}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2026-03-10",
}
links = requests.get(endpoint, headers=headers, timeout=30).json()["download_links"]
def load_records(text):
try:
data = json.loads(text)
return data if isinstance(data, list) else data.get("records", [data])
except json.JSONDecodeError:
return [json.loads(line) for line in text.splitlines() if line.strip()]
totals = defaultdict(lambda: {"posted": 0, "applied": 0})
for link in links:
for row in load_records(requests.get(link, timeout=30).text):
prs = row.get("pull_requests", {})
for item in prs.get("copilot_suggestions_by_comment_type", []):
key = item["comment_type"]
totals[key]["posted"] += item["total_copilot_suggestions"]
totals[key]["applied"] += item["total_copilot_applied_suggestions"]
for key, value in sorted(totals.items()):
rate = value["applied"] / value["posted"] if value["posted"] else 0
print(f"{key:15} posted={value['posted']:5d} applied={value['applied']:5d} rate={rate:.1%}")
The loader accepts either a JSON array or NDJSON, which makes the script less brittle when you wire it into an internal export job. In production, write the daily output to JSONL and send it to Grafana, Looker, Datadog, BigQuery, or your existing metrics pipeline.
For permissions, the enterprise endpoint is available to enterprise owners, billing managers, or authorized users with the View Enterprise Copilot Metrics permission.2 Use the corresponding organization endpoint if you want Organization-level reporting.
Common Pitfalls¶
A moving metric is not automatically an improvement. Three constraints should be visible in the dashboard notes.
First, comment_type is Copilot's label. A security suggestion may be security-related, but it is not a confirmed vulnerability or a severity score.
Second, applied rate has bias. Developers sometimes apply a minor suggestion to resolve a comment. They may also reject a correct suggestion for design reasons.
Third, scope is limited. This metric covers Copilot code review suggestions. It does not cover human review discussion, Copilot Chat conversations, or findings from external SAST tools.
For that reason, avoid a single executive score. Use suggestion density, applied rate, review duration, pull request throughput, and time to merge as a small decision set.
Summary¶
With comment_type-level Copilot code review metrics, AI review measurement can move from "did we enable it?" to "which types of suggestions are developers actually applying?"
The first question is not whether security or bug_risk is absolutely good. The first question is whether suggestion density and applied rate are stable against your own history.
Right after the release, the right move is not threshold design. Collect 4 to 12 weeks of baseline data, separate useful categories from noisy ones, and then turn Copilot code review from a helpful AI reviewer into a measurable review system.