Skip to content

What Is Harness Engineering—What Do You Actually Build?

For / Key Points

For: Developers who have encountered the term "harness engineering" but find its definition unclear. Readers who want to know: "So what do I actually need to do in my codebase?"

Key Points:

  • A harness isn't "just AI review" or "just linting." It's the design of how you combine constraints, information, verification, recovery, and review to keep AI working reliably
  • Returning to primary sources, both LLM-based review layers and sub-agents are treated as components of the harness
  • The more layers you stack, the greater your reproducibility, controllability, and verifiability. This article provides a framework for deciding how far to go in your own environment

Why Harness Engineering Looks Ambiguous

Over the past few months, "harness engineering" has spread rapidly as a term123. At the same time, different people use it to mean very different things. Some call writing AGENTS.md a harness. Others use the word for having AI review AI. Still others mean making lint / test / contract checks mandatory gates.

It looks ambiguous because all of these are partially correct—they're just referring to different layers. A harness isn't a single tool or technique; it's a design composed of multiple layers, and people differ on which layer they're talking about. This article returns to primary sources, decomposes the layers, and maps them to what you actually need to build in practice.

This site has previously published an overview article and a review loop implementation guide, but as the discussion has evolved, an updated framing was needed. This article is that update.


The Core Definition—What Did the Originators Actually Say?

Mitchell Hashimoto's approach: "When an agent makes a mistake, build mechanisms so that mistake never happens again"1. He cites both AGENTS.md updates (implicit prompting) and dedicated tooling (programmed tools) as means—not limiting it to deterministic tools.

The OpenAI Codex team reports that they have Codex perform self-review and additional agent reviews, embedding review, validation, feedback handling, and recovery into the system2. They also employ custom linters, structural tests, and architectural constraints that mechanically enforce taste invariants.

Birgitta Boeckeler analyzed OpenAI's harness architecture and explicitly noted the mix of "deterministic and LLM-based approaches"3. She also pointed out that functional and behavioral verification is relatively weak—underscoring the importance of the verification layer.

The common thread: a harness is every piece of system design, external to the model, that keeps the model working reliably and effectively. This includes linters, AGENTS.md, and AI review layers alike. LangChain's framing of "Agent = Model + Harness"4 aligns with this direction.


How Leading Practitioners Are Expanding the Harness

Beyond the core definitions from primary sources, leading practitioners are pushing harness implementations further.

Anthropic's latest engineering blog describes a three-agent architecture—planner, generator, evaluator—as "harness design"5. The evaluator is an LLM, but it applies hard thresholds against predefined criteria and uses Playwright MCP to interact with the actual application for scoring.

Notably, they report that in its initial state, the evaluator had a tendency to "approve despite finding issues," and it took tuning to achieve stable quality judgments5. The LLM review layer is a harness component, but its reliability changes dramatically when combined with verification mechanisms.

HumanLayer identifies six harness levers: system prompt, tools / MCPs, context, sub-agents, hooks, and skills6. Sub-agents are probabilistic LLMs, yet they are positioned as legitimate harness levers.


What Do You Actually Build—The Five Layers

Building on the primary source discussion, this section decomposes harness components into five practical layers.

┌─────────────────────────────────────────────────────────────┐
│  ① Constraint Layer                                         │
│                                                             │
│   Prohibitions, permissions, allowed file/directory scope   │
│   Network restrictions, sandbox configuration               │
│   → Structurally prevent what the agent "must not do"       │
├─────────────────────────────────────────────────────────────┤
│  ② Information Layer                                        │
│                                                             │
│   AGENTS.md / CLAUDE.md, design policies, ADRs, naming     │
│   conventions, reference specs, coding guidelines           │
│   → Tell the agent "what it should do"                      │
├─────────────────────────────────────────────────────────────┤
│  ③ Verification Layer                                       │
│                                                             │
│   lint / typecheck / test / contract / policy check         │
│   Runtime validation via Playwright, etc.                   │
│   → Same input always returns the same verdict.             │
│     Reproducible pass/fail criteria                         │
├─────────────────────────────────────────────────────────────┤
│  ④ Recovery Layer                                           │
│                                                             │
│   Rollback, retry, scoped diffs, failure abort              │
│   Loop iteration caps, human escalation                     │
│   → Mechanisms to safely return from failure states         │
├─────────────────────────────────────────────────────────────┤
│  ⑤ Review Layer                                             │
│                                                             │
│   Review by a separate AI from the executor / human review  │
│   Severity classification (P0/P1/P2), triage               │
│   → Eliminate executor bias, catch what was missed          │
└─────────────────────────────────────────────────────────────┘

Any layer can be a harness component. But the more layers you combine, the stronger the harness becomes.


Each Layer's Role and Limits

Layer 1: Constraints. Read-only sandboxes, file edit scope restrictions, network isolation. Structurally makes it impossible for the agent to do what it must not do. The strongest enforcement mechanism—but it can't communicate "what to do."

Layer 2: Information. AGENTS.md, design policies, coding guidelines. Provides context to the agent. As Hashimoto recommends, updating these every time the agent makes a mistake gradually codifies tacit knowledge1. However, the agent follows these "almost always" but not "without exception every time."

Layer 3: Verification. Lint, test, typecheck, contract. Returns the same verdict for the same input every time. Results don't change when you swap models. Provides reproducibility of pass/fail judgments. Boeckeler's observation that "functional and behavioral verification is weak"3 points to gaps in this layer.

Layer 4: Recovery. Loop iteration caps, rollback on failure, human escalation. Mechanisms to safely return the agent from unexpected states. Without this, pendulum problems (A to B to A fix loops) and infinite loops can't be stopped.

Layer 5: Review. Review by a separate AI or a human, distinct from the executor. Eliminates executor bias and adds a filter before reaching human eyes. Adding severity classification (P0: correctness / security, P1: maintainability, P2: style) gives structure to the probabilistic variance of review.

Here's a useful diagnostic: if you swap the review model for a different one, does the final pass/fail verdict stay roughly the same? If yes, Layer 3 verification is doing its job. If no, there's room to strengthen Layers 3 and 4.

The five-layer decomposition also works as a diagnostic frame for incident analysis. When an agent-caused problem occurs, identify which layer broke. "Information was provided but verification was missing." "Verification was in place but recovery wasn't." Identify the broken layer, and what to build next becomes obvious.


Configuration Examples by Context

With each layer's characteristics now visible, the question becomes how to combine them based on your team's scale and constraints.

Minimal Setup for Solo Development—Start with Layers 2 + 4

Write your project policies and prohibited patterns in AGENTS.md (Layer 2: information). When the agent makes mistakes, update it each time. If running loops, set an iteration cap (Layer 4: recovery).

Even this alone produces a visible improvement in agent output quality. The cost is near zero. If you already have lint and tests in your CI/CD pipeline, Layer 3 verification is naturally included.

Trigger to add the next layer: When multiple people start touching the agent's output, add Layer 1 (constraints) and Layer 3 (verification) to transition to a team setup.

Standard Team Setup—Layers 1 + 2 + 3 + 4

Restrict the edit scope with a sandbox (Layer 1: constraints). Write coding standards and prohibited patterns in AGENTS.md, shared across the team (Layer 2: information). Make lint / test / typecheck mandatory gates (Layer 3: verification). Set loop caps and escalate to humans when exceeded (Layer 4: recovery).

Introduce Layer 5 review when human review load becomes a bottleneck. The OpenAI Codex team ran agent-to-agent review as a near-replacement for human review2. However, they combined it with custom linters and structural tests—the review layer didn't stand alone but was paired with Layer 3 verification.

Trigger to add the next layer: When agent output directly affects customers or production, add Layer 5 (review) and automate Layer 4 (recovery) to transition to a production setup.

Strong Setup for Production Changes—All Five Layers

Specs, prohibitions, and acceptance criteria are externalized upfront (Layers 2 + 3: information + verification). Agent execution is confined to a sandbox (Layer 1: constraints). Lint / test / contract form the core of pass/fail judgments, with AI review limited to surfacing discussion points (Layer 5: review connected to Layer 3: verification). Below-threshold results trigger automatic rollback or require human approval (Layer 4: recovery). The entire process is recordable, reproducible, and auditable.

Anthropic's latest harness design5 closely resembles this configuration. The evaluator holds hard thresholds against predefined criteria, validates through Playwright MCP interaction, and marks the sprint as failed when criteria aren't met.

The SKILL.md approach introduced in this site's review loop implementation guide includes Layer 1 and 4 elements such as loop caps and read-only sandboxes, but Layer 3 verification isn't at the center of quality judgments. It sits between the "standard team setup" and the "strong production setup."


Three Dimensions for Measuring Harness Strength

To gauge how strong your harness is, use these three dimensions.

Reproducibility. Given the same input, do you get similar results? The thicker Layer 3 verification is, the higher the reproducibility. If swapping the review model doesn't significantly change pass/fail outcomes, reproducibility is high.

Controllability. Are your stop mechanisms, restrictions, and permissions well-defined? Layers 1 (constraints) and 4 (recovery) handle this. The benchmark: can you answer "how do you stop the agent when it goes off the rails?"

Verifiability. Can you judge quality outside the model? Layer 3 verification handles this. The more central lint / test / contract are to pass/fail judgments, the higher the verifiability.


Conclusion

Harness engineering isn't "just AI review" or "just linting." It's the design of how you combine constraints, information, verification, recovery, and review to keep AI working reliably.

You can start from any layer. But the more layers you stack, the greater your reproducibility, controllability, and verifiability. Map your current setup against the five layers to see what's missing—and the next thing to build becomes clear.

Related Articles

This article is part of the harness engineering concept series.



  1. Mitchell Hashimoto, "My AI Adoption Journey," mitchellh.com, February 5, 2026. https://mitchellh.com/writing/my-ai-adoption-journey 

  2. Ryan Lopopolo, "Harness engineering: leveraging Codex in an agent-first world," OpenAI, February 11, 2026. https://openai.com/index/harness-engineering/ 

  3. Birgitta Böckeler, "Harness Engineering," martinfowler.com, February 17, 2026. https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html 

  4. LangChain, "The Anatomy of an Agent Harness," March 2026. https://blog.langchain.com/the-anatomy-of-an-agent-harness/ 

  5. Prithvi Rajasekaran, "Harness design for long-running application development," Anthropic Engineering, March 24, 2026. https://www.anthropic.com/engineering/harness-design-long-running-apps 

  6. HumanLayer, "Skill Issue: Harness Engineering for Coding Agents," March 2026. https://www.humanlayer.dev/blog/skill-issue-harness-engineering-for-coding-agents