Skip to content

Lessons from Amazon's Outage: What Organizations Moving Fast with AI Were Missing

For / Key Points

For: Engineering leaders, SREs, and DevOps engineers at organizations that have adopted or are evaluating AI coding tools. Practitioners involved in change management and CI/CD pipeline design.

Key Points:

  • The outage is not about "AI broke things" — it's about governance failing to keep pace with accelerating deployment velocity
  • Amazon explicitly separated short-term controlled friction from long-term deterministic / agentic safeguards
  • Five areas worth auditing in your own organization emerge from the incident reporting and Amazon's response

What Happened at Amazon

In March 2026, multiple outages at Amazon's e-commerce service and AWS drew significant attention. The timeline and what can actually be confirmed:

PeriodEventConfirmed FactsDiscrepancy Between Reports and Amazon's Position
December 2025AWS Cost Explorer extended outage13-hour service disruption in parts of the China region4FT reported the change was made using AI coding tool Kiro. Amazon countered: "user error, not AI error — an access control misconfiguration"4
March 5, 2026~6-hour Amazon e-commerce outageLogin, pricing, and order completion disrupted. Amazon attributed to "software code deployment"6Amazon explicitly stated AWS was not involved2
March 10, 2026E-commerce engineering review meetingWeekly operations review (TWiST) conducted a deep-dive review2FT obtained a briefing note that initially mentioned "Gen-AI assisted changes"; CNBC later reported the GenAI references were removed from the note2

MarketScreener reported four Sev-1 incidents (defined as major system outages or significant degradation) within a single week1.


Amazon's Position

Before drawing conclusions, Amazon's explanation deserves direct attention.

In response to CNBC, Amazon's spokesperson stated: "There was only one AI-related incident, and there were no incidents caused by AI-written code."2 The TWiST meeting was described as "part of our regular weekly operations review" and "an example of continuous improvement."2

On the AWS-side outages, Amazon stated these were unrelated to the retail incidents2. Regarding the December Kiro-related reports, Amazon maintained: "access control misconfigurations can happen with any development tool, not just AI tools" and "there is no convincing evidence that AI tools have increased incidents."3


What the Reporting Reveals

The FT's version of the briefing note — before GenAI references were removed — listed "high blast radius" and "Gen-AI assisted changes" as incident trends, with "new GenAI usage where best practices and safeguards are not yet sufficiently established" cited as a contributing factor4.

SVP Dave Treadwell acknowledged in an internal email that site availability "has not been great recently" and noted that GenAI tools being used to assist and accelerate production changes led to "unsafe practices."2

Reconciling the reporting with Amazon's public position, the core issue is not "did AI cause the outages." The question is whether the mechanisms for safely controlling a higher volume and velocity of changes had matured sufficiently.


How Amazon Responded

Treadwell's memo made the two time horizons explicit:

"We will implement temporary safety practices that introduce controlled friction to changes in the most important parts of the Retail experience. In parallel, we are investing in more durable solutions including deterministic and agentic safeguards."2

Short-term: controlled friction to slow changes to high-impact systems. Long-term: mechanical guardrails for permanent protection. The approach is not "stop using AI" — it is "build the governance infrastructure."

FT-sourced reporting indicated enhanced senior review for AI-assisted changes. However, Amazon's spokesperson told CNBC that reports of "requiring senior approval for all AI-assisted changes" were inaccurate2, so the exact scope of the measures diverges between reporting and the official account. At minimum, additional review and documentation requirements for high-impact (Tier-1) systems appear to be included.


Five Areas to Audit

Dismissing this as "an AI tool problem" or "Amazon being sloppy" misses the point. The structural challenge facing organizations that have accelerated deployment velocity with AI coding tools is shared.

Constellation Research analyst Chirag Mehta flagged the risk that adding human review would negate AI's speed gains, advocating instead for pre-deployment policy checks, blast radius controls, canary releases, automated rollback, and change traceability5.

Info-Tech Research Group's Manish Jain framed the core issue this way: AI is not making more mistakes — it is operating at a scale where small errors produce massive blast radius. Agentic AI compresses time-to-release, but governance has not kept up with that acceleration5.

Five areas worth auditing, informed by the incident reports and Amazon's response:

Permission Design

Has your production access control been revisited with AI agents in mind?

In December 2025, Amazon's own account describes "an engineer with broader-than-expected permissions" using an AI tool, bypassing the normal two-person approval requirement4. Regardless of where the root cause is attributed, "production change permissions were not appropriately scoped" remains an operational design gap.

AI agents typically inherit the operator's permissions. Broad permissions given to a human engineer flow through to the agent — without the implicit judgment a human would apply. You need agent-specific permission scopes and escalation triggers for high-risk operations.

Blast Radius Control

Does your system prevent a single failed deployment from taking down the entire service for hours?

The briefing note's repeated reference to "high blast radius" signals how critical this is4. Failures cannot be eliminated. The question is whether the blast radius of any given failure is bounded in advance.

Canary releases, feature flags that decouple deployment from enablement, and pre-defined blast radius caps for high-impact systems are the standard levers. In environments where AI increases change frequency, designing deployment units that "fail within acceptable bounds" becomes more important.

Change Traceability

Can you distinguish AI-assisted changes from human-authored changes in your audit trail?

The fact that GenAI references were later removed from the briefing note2 illustrates how difficult it is to accurately reconstruct change attribution after the fact. Metadata tagging for AI-assisted changes, observability infrastructure that traces the causal relationship between changes and production behavior, and durable searchable audit logs are needed.

Constellation Research calls for teams to always know "which changes were AI-assisted, who approved them, and what behavioral changes occurred in production."5

Review Architecture

Does your review process scale with the increase in AI-generated code volume?

FT-sourced reporting described strengthened senior review for AI-assisted changes. But human-only review does not scale as change volume increases. Treadwell's explicit separation of temporary (controlled friction) and durable (deterministic / agentic safeguards) measures reflects an awareness that human checks alone are not sustainable. Tiered review depth by risk level, combined with automated deterministic policy checks, is the architecture needed for the long term.

Rollback Capability

Has your rollback mechanism been tested in production — not just documented in a runbook?

The six-hour duration of the March 5 outage suggests rollback did not execute immediately. AI-assisted changes often span multiple files and services simultaneously, making rollback granularity design more complex. Smaller, independently reversible deployment units reduce this complexity.


How to Read This Incident

The most important signal from Amazon's response is the direction: not "stop AI" but "build governance infrastructure." The explicit separation of controlled friction as a temporary measure and deterministic / agentic safeguards as a durable solution represents a direct acknowledgment of the gap between AI deployment velocity and operational governance maturity.

The practical risk to watch is "temporary measures becoming permanent." Adding human review is rational as a buffer while mechanical guardrails are being built — but without a defined exit condition, it quietly becomes the standard process. If you add a temporary measure, define its exit condition at the same time.

The five areas — permission design, blast radius control, change traceability, review architecture, and rollback capability — are not AI-specific problems. But in environments where AI accelerates the speed and volume of changes, gaps in any of these areas surface faster and at greater scale. That an organization with Amazon's scale and operational maturity encountered this is sufficient reason for any organization adopting the same tools to audit its own governance design.



  1. Based on MarketScreener reporting (March 11, 2026). Four Sev-1 incidents reportedly occurred within one week. 

  2. Based on CNBC reporting (March 10, 2026). Amazon's spokesperson stated there was one AI-related incident and no incidents caused by AI-written code. The TWiST meeting was described as a regular weekly operations review. Treadwell's memo referenced "temporary safety practices," "controlled friction," and "deterministic and agentic safeguards." Amazon explicitly stated the AWS outages were unrelated to the retail incidents. CNBC also reported the removal of GenAI references from the briefing note. 

  3. Based on The Register reporting (March 10, 2026). Amazon stated that access control misconfigurations can occur with any development tool and that there is no convincing evidence AI tools have increased incidents. 

  4. Based on Financial Times reporting (March 10, 2026, and February 2026 AWS-related reports). The briefing note reportedly listed "trend of incidents," "high blast radius," and "Gen-AI assisted changes." Per CNBC reporting, GenAI references were subsequently removed from the note. Regarding the December 2025 AWS outage, Amazon disputed FT's account, describing it as "user error, not AI error — an access control misconfiguration." 

  5. From CIO / InfoWorld reporting (March 11, 2026). Comments from Chirag Mehta, Constellation Research, and Manish Jain, Info-Tech Research Group. 

  6. March 5, 2026 e-commerce outage. Reuters, CNBC, and other outlets reported the cause as "software code deployment," with disruptions to login, pricing, and order completion lasting approximately six hours.