Scrum × AI Coding: A Practical Guide to Spec Management in Agile Environments¶
Integrating AI coding into Scrum makes spec drift more likely with just the traditional "Story + AC + Review" approach. This guide covers concrete steps for updating User Stories, AC, DoD, and review practices for teams using Claude Code / Copilot / Cursor.
Related Guides¶
- Spec vs Code vs Prompt: Comparing Documentation Strategies
- Claude Code Auto-Permission Guide (Operational Design)
- Claude Code Installation & Environment Setup
- GitHub Copilot Custom Instructions in Practice
- Codex CLI Approval Modes: Safe Usage Guide
- Claude Code Project Management Best Practices
The "Minimal Spec" Approach in Scrum, and Its Broken Assumptions¶
Many Scrum teams consolidate specs into minimal PBI information. User Stories follow the "As a ~, I want ~, so that ~" format, and Acceptance Criteria (AC) are scenario-based. The Agile Manifesto's "working software" spirit translates into "move forward with dialogue rather than thick spec documents."
When only humans were developing, this worked. Developers knew the team's context and could implicitly fill in gaps like "when this PO says 'simple,' it usually means about this much." If something was missing, a quick verbal check sufficed.
When AI coding enters the picture, this assumption breaks down.
The Gap AI Coding Introduces¶
The fundamental risk of AI isn't ignoring ambiguity. It's filling in ambiguous parts "nicely" on its own.
Typical Misalignment Patterns¶
1. AI doesn't know the team's tacit knowledge
Story: "Display recently added items at the top of the list screen"
The team's shared understanding is "recent = past 30 days" and "sort priority varies by item type." But AI arbitrarily decides its own convenient interpretation.
2. AI rewrites specs with "best practices"
AC: "Limit access to single sign-on users"
The spec clearly states "use Company A's IdP SAML," but AI judges "OIDC is more modern" and tries to implement with OIDC assumptions.
3. Context gets dropped
When passing multiple stories at once, AI simply "forgets" some assumptions. Each PBI may be internally consistent, but cross-cutting constraints get dropped. Where a human would remember "we decided this in the previous story," AI simply forgets.
Project-level context management
Mechanisms like Claude Code's CLAUDE.md and Cursor Rules for maintaining project-level context are emerging. However, few teams currently operate these effectively, and the tacit knowledge gap remains.
Reviews can't catch everything either¶
PR reviewers are already under pressure. AI-generated code increases volume per PR, and reviewers are stretched thin with "logic + design + security + style." There's no bandwidth left for "wait, this doesn't match the spec."
The result: "Tests pass, code looks fine. But something feels off" — and spec misalignment surfaces after release.
Restructuring AC: What / Rule / How¶
The first effective measure when assuming AI is changing the granularity and structure of AC.
Not "write verbosely" — identify where AI will get lost¶
For example, don't just write "display recently added items at the top." Add:
- Definition of "recent" (past N days, timezone)
- Sort priority (
created_at DESCor update time?) - Pagination or limits (up to 100 items, etc.)
Trying to specify everything will fail, so the key is to focus on only the parts AI might arbitrarily decide.
The 3 AC categories¶
| Type | Content | Examples |
|---|---|---|
| AC-What | User-visible behavior | "Can do ~" "~ is displayed" |
| AC-Rule | Business rules and constraints | Pricing logic, permission control, SLA |
| AC-How(critical) | Critical implementation constraints | Data sources, encryption methods, cache usage |
AC-How(critical) includes only "parts that are dangerous to leave to best practices."
Guidelines for what to write:
- What: "Parts where AI might make a different choice" (when ambiguous expressions exist)
- Rule: "Parts where mistakes have business impact" (calculation logic, permissions, pricing)
- How: "Parts where AI changes would be dangerous" (security, compliance, performance)
Developers draft the judgment, and PO/SM review from a risk perspective.
Jira / Azure Boards example: Label as AC-W1, AC-R1, AC-H1 under ### Acceptance Criteria in the Description. Or create custom fields with checklists by type.
Should How be in AC or separated?¶
Parts directly tied to security, compliance, or performance go in AC-H on the PBI. Implementation preference-level How goes in ADR or design notes, not AC. The practical approach is to include only "what you absolutely don't want AI to do" in AC-H.
On PBI granularity¶
As AI coding increases implementation speed, larger PBIs become acceptable. But getting AI to build correctly requires sufficiently detailed AC. In other words, "effort to split PBIs small" is replaced by "effort to write AC in detail." Recognizing that investment priorities have shifted is worth keeping in mind.
Updating DoD and Reviews¶
Add "Spec–Test–Code consistency" to DoD¶
Add this line to the existing DoD:
"Spec-Test-Code consistency check for PBI AC has been performed"
Specifically, either the developer verifies AC–Test–Code alignment, or an AI reviews whether "there are any contradictions between this AC and the test/code."
AI review check before PR¶
If using AI coding, using AI review as self-check is rational. The developer passes PBI Story + AC and the change diff to AI, which flags inconsistencies or gaps. Human reviewers then focus on design, quality, and risk.
Prompt example:
Against the following User Story and Acceptance Criteria,
check if the changed code has any inconsistencies or omissions.
【Check Points】
- Behavior in AC but not implemented
- Behavior implemented but not in AC
- Contradictions with AC-Rule (business rules)
- Violations of AC-How(critical) (implementation constraints)
【Output Format】
If inconsistencies exist:
- [AC-W1] Implementation gap: ~
- [AC-R2] Contradiction: ~ (implemented as ~)
If no issues: "✅ No issues"
【User Story】
{Story text}
【Acceptance Criteria】
{List of AC-W, AC-R, AC-H}
【Change Diff】
{Diff}
Execution timing:
- Recommended: Manual execution before local commit
- Intermediate: Auto-execute on PR creation via GitHub Actions, post results as comment
- Advanced: Integrate into pre-commit hook
Limitations of AI self-check
Not a panacea and still a developing technique. Don't over-rely; use in combination with human reviews.
Separate review perspectives into 3 layers¶
| Layer | Perspective | Check content |
|---|---|---|
| What | PO-side | Are Story and AC-What / AC-Rule satisfied? |
| Interface | External impact | Do API / event / DB schema changes contradict other components? |
| How | Developer | Patterns, testability, performance, security |
AI self-check covers What/Rule and Code consistency; human review focuses on Interface/How. The Interface layer requires special attention since AI changes can significantly impact other teams.
Interface layer detection: openapi-diff for API spec changes, buf breaking for Protobuf backward compatibility, migration file diff reviews, etc. If tools aren't available, note in AC-H "separate review required for API contract changes."
Spec Drift Management¶
"Actually, let's do it this way" emerging during implementation is unavoidable. As AI speeds up implementation, specs get overtaken more often. Rather than prohibiting this, set rules to detect and absorb it.
3 levels of implementation changes¶
| Level | Content | Response |
|---|---|---|
| 1 | Minor judgment (within AC interpretation) e.g.: wording tweaks, UI ordering | Note in PBI comment, PO verifies later |
| 2 | Affects business logic (small impact) e.g.: validation conditions added, calculation logic adjusted | AC-Rule update required before PR merge |
| 3 | Affects external contracts or pricing e.g.: API contract change, billing logic change | Split PBI, agree with PO in advance as new story |
Just incorporating "Level 2+ requires AC update before merge" into DoD significantly reduces spec-implementation gaps.
Include spec updates in Sprint Review¶
In Sprint Review, lightly review updated specs (AC / schema / ADR changes) alongside implemented features. No need to check everything in detail — sharing "top 3 points where specs changed" is sufficient.
Changes in Sprint Operations¶
Shortening Sprint duration as an option¶
Two-week Sprints were common because implementation took time and one week couldn't produce meaningful increments. If AI increases implementation speed, one-week Sprints become realistic.
Feedback cycles get faster and "built it but it was wrong" is discovered sooner. However, AC preparation and review process efficiency need to come first. If review load and spec drift detection still take time despite faster implementation, shortening backfires. This is best viewed as "an option that's now available" rather than "something you should do."
Estimation importance may decrease¶
Estimation was important because implementation was the bottleneck. When AI increases implementation speed, the bottleneck may shift to validating "what to build" and "whether what was built is correct."
However, AI output quality varies widely, and many teams see no reduction in implementation time when including rework. Whether estimation becomes unnecessary depends on the team's AI adoption maturity. Redirecting estimation time to AC detailing and Sprint Goal clarification is a viable approach.
Shift in time usage¶
| Activity | Traditional | After AI introduction |
|---|---|---|
| Implementation/Coding | Much | Decreases |
| Estimation/Planning | Some | Decreasing trend |
| AC/Spec detailing | Less | Increases |
| Review/Quality checking | Some | Increases |
| Hypothesis validation/Feedback | Less | Increases |
Investment is shifting from "implementation" to "spec clarification" and "validation."
How to Introduce to Teams¶
Suddenly declaring "we're doing Spec–Test–Code consistency checks on all stories" will definitely face resistance.
Messages by role¶
For POs: The goal isn't making specs heavier — it's reducing gaps between expectations and implementation. Rework decreases, and as AI speeds up implementation, judging "what to build" becomes more important.
For SMs: If review bottlenecks stem from spec ambiguity, reviewing Story / AC / DoD is process improvement itself. Frame it as "organize review pain points upfront" rather than "increase documentation."
For Developers: This becomes a bulwark against AI arbitrarily rewriting specs. Setting up AC / DoD / AI self-checks means you're less likely to be labeled "the person fooled by AI into writing bad code." It functions as a self-defense line.
Start with minimum patterns¶
| Pattern | Content |
|---|---|
| High-risk only | Limit to billing / permissions / external API integration; try AC split + AI self-check |
| New features only | Don't chase existing modifications; apply from new epics |
| AI self-check only | Add one line to DoD: "AI-based Spec–Code check performed" |
| Try Sprint shortening | Try 1-week Sprints for 1–2 Sprints; validate if AC preparation keeps up |
Gradual introduction suppresses "process got heavy" backlash while making it easier to accept as guardrails for safe AI adoption.
Summary¶
AI coding introduces strain into Scrum's "minimal spec + absorb through dialogue" model. There are 4 response directions.
- Restructure AC — Decompose into What / Rule / How(critical), explicitly stating only parts AI might arbitrarily decide
- Add Spec–Code consistency check to DoD — Insert AI self-check before PR, reducing human review load
- Manage spec drift — Level-classify implementation changes; Level 2+ requires AC update before merge
- Review Sprint operations — Consider duration shortening, estimation lightening, time allocation shifts
Starting gradually from high-risk stories or new features avoids a heavy process.
AI coding increases implementation speed while making spec-implementation gaps harder to detect. As guardrails to fill that gap, updating Story / AC / DoD / Sprint operations is worth considering.
What doesn't change
Human accountability, Sprint Goal-centered operations, and Scrum's foundational values of transparency, inspection, and adaptation remain unchanged even in the AI era.