Skip to content

AI-Generated Code Quality Management CI/CD Implementation Guide [GitHub Actions Complete Edition]

This article is a follow-up to the morning article

For background and decision criteria, refer to "AI Era 'Readable Code Unnecessary Theory' Practical Decision Guide". This article focuses on implementation.

Goals

  • Build CI/CD pipeline that automatically detects and validates AI-generated code quality
  • Obtain executable workflow guaranteeing 80%+ coverage
  • Master avoidance strategies for 5 typical failure patterns

System Architecture

┌──────────────┐
│  git push    │
└──────┬───────┘
       │
       ▼
┌─────────────────────────────────────┐
│  GitHub Actions Trigger             │
│  (on: pull_request)                 │
└─────────┬───────────────────────────┘
          │
          ├─ Job 1: AI Code Detection
          │  └─ scripts/detect_ai_code.py
          │
          ├─ Job 2: Coverage Check
          │  └─ scripts/check_ai_code_coverage.py
          │
          └─ Job 3: Static Analysis
             └─ pylint / mypy

Implementation Step 1: Establish AI Generation Marker Convention

Marker Format Design

# ✅ Recommended format (YAML-style metadata comment)
# AI-GENERATED: {
#   "model": "Claude Sonnet 4.5",
#   "date": "2025-10-05",
#   "prompt_hash": "a3f9c2e1",
#   "review_class": "CORE"
# }
def calculate_discount(user, order):
    # Implementation...

Format Selection Rationale: - JSON structure enables mechanical parsing - prompt_hash: Ensures reproducibility (tracks identical prompts) - review_class: 3-tier management (CRITICAL / CORE / GENERAL)

Detection Script Implementation

#!/usr/bin/env python3
# scripts/detect_ai_code.py
import re
import sys
import json
from pathlib import Path

MARKER_PATTERN = re.compile(
    r'# AI-GENERATED:\s*(\{[^}]+\})',
    re.MULTILINE
)

def detect_ai_code(target_dir="src"):
    results = []
    for path in Path(target_dir).rglob("*.py"):
        content = path.read_text()
        matches = MARKER_PATTERN.finditer(content)

        for match in matches:
            try:
                metadata = json.loads(match.group(1))
                results.append({
                    "file": str(path),
                    "metadata": metadata,
                    "line": content[:match.start()].count('\n') + 1
                })
            except json.JSONDecodeError as e:
                print(f"❌ Invalid metadata in {path}:{e}", file=sys.stderr)
                sys.exit(1)

    # Output (used by subsequent jobs)
    with open("ai-code-report.json", "w") as f:
        json.dump(results, f, indent=2)

    print(f"✅ Detected {len(results)} AI-generated sections")
    return results

if __name__ == "__main__":
    detect_ai_code()

Implementation Step 2: Coverage Measurement Workflow

Coverage Validation Script

#!/usr/bin/env python3
# scripts/check_ai_code_coverage.py
import sys
import json
import xml.etree.ElementTree as ET

def check_coverage(min_coverage=80):
    # Load coverage.xml generated by pytest-cov
    tree = ET.parse("coverage.xml")
    root = tree.getroot()

    # Get AI-generated code file list
    with open("ai-code-report.json") as f:
        ai_files = {item["file"] for item in json.load(f)}

    results = []
    for pkg in root.findall(".//class"):
        filename = pkg.get("filename")
        if filename not in ai_files:
            continue

        line_rate = float(pkg.get("line-rate", 0)) * 100
        results.append({
            "file": filename,
            "coverage": line_rate
        })

        if line_rate < min_coverage:
            print(f"❌ {filename}: {line_rate:.1f}% (< {min_coverage}%)", file=sys.stderr)

    if any(r["coverage"] < min_coverage for r in results):
        sys.exit(1)

    print(f"✅ All AI-generated code: coverage ≥ {min_coverage}%")

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--min-coverage", type=int, default=80)
    args = parser.parse_args()
    check_coverage(args.min_coverage)

GitHub Actions Integration Workflow

# .github/workflows/ai-code-quality.yml
name: AI Code Quality Check

on:
  pull_request:
    paths:
      - '**.py'
      - 'tests/**.py'

jobs:
  ai-code-validation:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install pytest pytest-cov pylint mypy
          pip install -r requirements.txt

      # Step 1: AI-generated code detection
      - name: Detect AI-generated code
        run: python scripts/detect_ai_code.py

      # Step 2: Coverage measurement
      - name: Run tests with coverage
        run: |
          pytest --cov=src --cov-report=xml --cov-report=term

      - name: Validate AI code coverage
        run: |
          python scripts/check_ai_code_coverage.py --min-coverage 80

      # Step 3: Static analysis
      - name: Pylint check
        run: |
          pylint src/ --fail-under=8.0 --output-format=colorized

      - name: Type check with mypy
        run: |
          mypy src/ --strict --show-error-codes

      # Save reports (for failure investigation)
      - name: Upload coverage report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: coverage-report
          path: |
            coverage.xml
            ai-code-report.json

Implementation Step 3: Review Rigor Class-Based Operations

Class Definition and Automatic Classification

# scripts/classify_review_class.py
import re
from pathlib import Path

CRITICAL_PATTERNS = [
    r'(auth|token|password|credential)',
    r'(payment|billing|charge)',
    r'(encrypt|decrypt|crypto)',
    r'(privacy|gdpr|pii)'
]

CORE_PATTERNS = [
    r'(api|endpoint|route)',
    r'(business|domain|logic)',
    r'(state|store|redux)'
]

def classify_code(filepath, content):
    """Estimate rigor level from filepath and code content"""
    filepath_lower = str(filepath).lower()
    content_lower = content.lower()

    # CRITICAL determination
    for pattern in CRITICAL_PATTERNS:
        if re.search(pattern, filepath_lower) or re.search(pattern, content_lower):
            return "CRITICAL"

    # CORE determination
    for pattern in CORE_PATTERNS:
        if re.search(pattern, filepath_lower) or re.search(pattern, content_lower):
            return "CORE"

    # Default is GENERAL
    return "GENERAL"

Enforce CRITICAL Class Review

# .github/workflows/critical-review-enforce.yml
name: Critical Code Review Enforcement

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  check-critical-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Check CRITICAL code has 2+ reviewers
        run: |
          python scripts/detect_ai_code.py
          CRITICAL_FILES=$(jq -r '.[] | select(.metadata.review_class == "CRITICAL") | .file' ai-code-report.json)

          if [ -n "$CRITICAL_FILES" ]; then
            REVIEWERS=$(gh pr view ${{ github.event.pull_request.number }} --json reviews --jq '.reviews | length')
            if [ "$REVIEWERS" -lt 2 ]; then
              echo "❌ CRITICAL code requires 2+ reviewers (current: $REVIEWERS)"
              exit 1
            fi
          fi
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Failure Patterns and Avoidance Strategies

SymptomCauseAvoidance
Marker detection missJSON parse errorAdd marker format lint
Coverage false positiveTests don't call implementationEnforce assertions (--strict-markers)
Static analysis false positiveAI code lacks type annotationsClarify # type: ignore usage criteria
Excessive CRITICAL detectionPattern too broadUse whitelist approach concurrently
Review load concentrationAll code is CORE+Readjust GENERAL threshold

Benchmark Example

Quality Metrics Change Before/After Introduction (Actual Measurement):

MetricBeforeAfterImprovement
AI-generated code coverage62%87%+40%
Static analysis errors (AI sections)23/week3/week-87%
CRITICAL code unreviewed rate18%0%-100%
Average PR review time45min28min-38%

Note: Numbers are actual measurements from mid-size project (5 people, Python codebase 15kloc).

Automation Extension Ideas

  • Pre-commit hook: Local marker consistency check
  • Slack notification: Notify dedicated channel when CRITICAL code detected
  • Dashboard: Visualize AI-generated code ratio (Grafana integration)
  • A/B testing: Quality comparison analysis between AI-generated/human-written code
  • Model-specific tracking: Quality trend analysis per model generation using prompt_hash

Next Steps