Measuring AI Review Value with Copilot Code Review Metrics: KPI Design by security and bug_risk¶

Audience: Engineering teams measuring AI code review

Engineering managers, SREs, and platform teams that use GitHub Copilot code review and need a practical way to measure whether AI review is working.

Key Points¶

On May 8, 2026, GitHub added comment_type-level Copilot code review suggestion metrics.
Labels such as security and bug_risk are Copilot-assigned categories, not severity ratings. Treat them as review-signal buckets.
Repository-level breakdowns are not available yet, so the first step is an Enterprise or Organization baseline over 4 to 12 weeks.

What Changed¶

On May 8, 2026, GitHub added copilot_suggestions_by_comment_type to the Copilot usage metrics API.¹ This makes it possible to aggregate Copilot code review suggestions by comment_type.

Each entry has three values that matter for KPI design.

comment_type: the type label assigned by Copilot
total_copilot_suggestions: the number of Copilot suggestions posted for that type
total_copilot_applied_suggestions: the number of Copilot suggestions that developers applied

This is not a new one-call dashboard that explains all review value. More precisely, Copilot usage metrics reports now include a new breakdown field for Copilot review suggestions.

The most important caveat is that security and bug_risk are not severity ratings. They are Copilot-assigned categories. In KPI work, read them first as "what kind of review signal Copilot produced" and "how often developers applied that signal."

Why This KPI Helps¶

AI review metrics often fail because aggregate counts are too vague. More suggestions can mean better coverage, but it can also mean more false positives.

Once the data is grouped by comment_type, better questions become possible.

Are security suggestions increasing while applied rate is falling?
Is bug_risk highly applied but too rare to matter?
Did one category spike after a policy or configuration change?
Did the mix of accepted suggestions change after a team changed its review rules?

The central metric is applied rate by type.

applied rate by type = total_copilot_applied_suggestions / total_copilot_suggestions

This is close to a practical proxy for whether developers considered that suggestion type worth applying. It is not a perfect value metric. A developer may apply a trivial suggestion just to resolve a comment, and a valid suggestion may be rejected for architectural reasons. Pair it with review duration, follow-up comments, and rework signals.

KPI Design by comment_type¶

Do not start with a complex composite score. A useful baseline needs four metrics by Enterprise or Organization scope and by comment_type.

Metric	Formula	Use
Suggestion density	Suggestions by type / reviewed PRs	Shows where Copilot reacts
Applied rate by type	Applied suggestions / suggestions	Shows how often developers adopt the signal
Suggestion mix	Suggestions by type / all suggestions	Shows the distribution of review output
Applied mix	Applied suggestions by type / all applied suggestions	Shows where accepted value concentrates

The strongest view is a two-axis map: suggestion density and applied rate.

State	Interpretation	Next action
High density, high applied rate	Core category	Put it in the standard KPI set
High density, low applied rate	Possible false-positive cluster	Review paths, rules, and review triggers
Low density, high applied rate	Rare but useful signal	Expand coverage or target repositories
Low density, low applied rate	Low current value	Observe before optimizing

Repository-level breakdown is not part of this release. That means threshold design should wait. Collect 4 to 12 weeks of Enterprise or Organization data first, then set local baselines.

Minimal Collection Script¶

The Copilot usage metrics REST API returns download_links for report files rather than the report body itself.² The minimal implementation is: fetch the latest 28-day report links, download the report files, then aggregate pull_requests.copilot_suggestions_by_comment_type.

import json
import os
from collections import defaultdict

import requests

enterprise = os.environ["GITHUB_ENTERPRISE"]
token = os.environ["GITHUB_TOKEN"]
base = "https://api.github.com"
endpoint = f"{base}/enterprises/{enterprise}/copilot/metrics/reports/enterprise-28-day/latest"
headers = {
    "Authorization": f"Bearer {token}",
    "Accept": "application/vnd.github+json",
    "X-GitHub-Api-Version": "2026-03-10",
}
links = requests.get(endpoint, headers=headers, timeout=30).json()["download_links"]

def load_records(text):
    try:
        data = json.loads(text)
        return data if isinstance(data, list) else data.get("records", [data])
    except json.JSONDecodeError:
        return [json.loads(line) for line in text.splitlines() if line.strip()]

totals = defaultdict(lambda: {"posted": 0, "applied": 0})
for link in links:
    for row in load_records(requests.get(link, timeout=30).text):
        prs = row.get("pull_requests", {})
        for item in prs.get("copilot_suggestions_by_comment_type", []):
            key = item["comment_type"]
            totals[key]["posted"] += item["total_copilot_suggestions"]
            totals[key]["applied"] += item["total_copilot_applied_suggestions"]

for key, value in sorted(totals.items()):
    rate = value["applied"] / value["posted"] if value["posted"] else 0
    print(f"{key:15} posted={value['posted']:5d} applied={value['applied']:5d} rate={rate:.1%}")

The loader accepts either a JSON array or NDJSON, which makes the script less brittle when you wire it into an internal export job. In production, write the daily output to JSONL and send it to Grafana, Looker, Datadog, BigQuery, or your existing metrics pipeline.

For permissions, the enterprise endpoint is available to enterprise owners, billing managers, or authorized users with the View Enterprise Copilot Metrics permission.² Use the corresponding organization endpoint if you want Organization-level reporting.

Common Pitfalls¶

A moving metric is not automatically an improvement. Three constraints should be visible in the dashboard notes.

First, comment_type is Copilot's label. A security suggestion may be security-related, but it is not a confirmed vulnerability or a severity score.

Second, applied rate has bias. Developers sometimes apply a minor suggestion to resolve a comment. They may also reject a correct suggestion for design reasons.

Third, scope is limited. This metric covers Copilot code review suggestions. It does not cover human review discussion, Copilot Chat conversations, or findings from external SAST tools.

For that reason, avoid a single executive score. Use suggestion density, applied rate, review duration, pull request throughput, and time to merge as a small decision set.

Summary¶

With comment_type-level Copilot code review metrics, AI review measurement can move from "did we enable it?" to "which types of suggestions are developers actually applying?"

The first question is not whether security or bug_risk is absolutely good. The first question is whether suggestion density and applied rate are stable against your own history.

Right after the release, the right move is not threshold design. Collect 4 to 12 weeks of baseline data, separate useful categories from noisy ones, and then turn Copilot code review from a helpful AI reviewer into a measurable review system.