Skip to content

The Current State of Harness Engineering Terminology — A Map as of April 2026

For:

Developers and tech leads already using coding agents such as Claude Code, Codex, and Cursor who need a clearer map of the changing meaning of "harness engineering." This is a companion to the March five-layer model article.

Key Points

Question this section answers: What map should readers keep in mind before the details?

  • As of April 2026, the definition of "harness" is split across vendors, but the split becomes manageable when organized by who builds it
  • For the user side, the basic structure is a 2x2 of guides/sensors and computational/inferential controls
  • The earlier five-layer model (constraints/information/verification/recovery/review) is orthogonal to the two-layer and 2x2 models, so they can be used together

The question for this article is simple. When the same word points to different systems, which map helps a working team decide what to build?

Why Harness Is Hard to Pin Down

Question this section answers: Why do different people mean different things by the same term?

"Harness" originally refers to tack for controlling and directing a working animal. In AI, the term appeared at least in the 2021 EleutherAI lm-evaluation-harness, where it named an evaluation framework for running language models under controlled conditions1. By 2024, the word was also used in IDE evaluation, such as the Copilot Evaluation Harness2.

In early 2026, Mitchell Hashimoto named the fifth stage of his AI adoption journey "Engineer the Harness"3. A few days later, OpenAI published its Codex report on building a large software product with agent-written code, and the phrase entered mainstream developer discussion4.

Terminology drift is normal when a field is being actively explored. The hard part is speed. LangChain, Anthropic, OpenAI, Thoughtworks, and individual builders all started using "harness" from their own product positions.

The core confusion is structural. Vendor-side infrastructure and user-side adaptation around a specific codebase are being discussed with the same word.

The Most Useful Split: Who Builds It

Question this section answers: What is the lowest-effort axis that cuts through the confusion?

The first useful axis is "who builds the harness." Birgitta Böckeler's Martin Fowler article shows a concentric model: model, builder harness, and user harness5. That split resolves most of the ambiguity.

Three concentric layers: model, builder harness, and user harnessThe model sits in the center. The builder harness around it is shipped by the agent vendor. The user harness is built by the coding-agent user around a specific codebase.User HarnessAGENTS.md / Skills / Hooks / MCP / CIBuilder Harnessagent loop / tools / memoryModelLLMRaw model -> builder harness makes it an agent -> user harness adapts it to the codebase

The builder harness is the foundation shipped by Anthropic, OpenAI, Cursor, or another agent provider. It includes system prompts, tool definitions, the agent loop, compaction, memory management, sandboxing, and similar machinery.

LangChain's Harrison Chase described the leaked Claude Code codebase as "512k lines"6. Compared with OpenAI's roughly 100-line AGENTS.md map4, that is more than three orders of magnitude larger. Whether every line belongs to the harness is less important than the scale signal: the builder-side layer can be large.

The user harness is what a team builds outside the vendor's agent for its own codebase. AGENTS.md, Skills, Hooks, MCP servers, lint/test loops, review agents, design docs, and CI quality gates belong here.

For most developers, this outer layer is the real work. They do not need to write Claude Code or Codex itself. They need to decide how their own codebase exposes rules, tests, recovery paths, and review loops to the agent.

Why Definitions Differ

Question this section answers: Why do companies and individuals define the term differently, and can that confusion be resolved?

Most of the disagreement follows the incentives of the speaker. Agent-platform companies define harness broadly. Agent builders emphasize the control loop. User-facing guides focus on constraints, sandboxes, repository knowledge, and validation.

PlayerPositionKey phraseMain concern
LangChainAgent platformModel + HarnessState, tools, runtime
AnthropicAgent builderControl loopLong-running work
OpenAICodex practitionerEnvironment and constraintsRepo knowledge, validation
Birgitta BöckelerConsulting lensGuides and sensorsUser-side control
Phil SchmidTeaching analogyHarness = OSLong-task infrastructure
Ashpreet BediSystems designRegular softwareWhole-system operation

Vivek Trivedy at LangChain defines the harness broadly: "Agent = Model + Harness"7. That makes sense for a platform whose users are building agents from components.

Anthropic's harness articles focus on the loop that invokes the model, manages tools, carries state across sessions, and evaluates work89. OpenAI's Codex report highlights a short AGENTS.md, structured docs/, custom linters, structural tests, and isolated local environments4.

Böckeler describes the user's outer harness through cybernetic controls5. Phil Schmid uses the operating-system analogy: model as CPU, context window as RAM, harness as OS10. Bedi frames agentic software as regular software with agents in the business-logic layer, requiring coordinated design across agent, data, security, interface, and infrastructure layers11.

None of these definitions is simply wrong. They come from different positions. The useful move is to first decide whether the reader is an agent vendor or a coding-agent user.

Another Axis for User Harnesses

Question this section answers: How is the user harness structured internally?

The user-side harness can be split one more time. Böckeler's model gives a horizontal axis of guides/sensors and a vertical axis of computational/inferential controls5. Guides act before the agent acts. Sensors observe after action and help the agent self-correct.

2x2
Guide
before action
Sensor
after action
Computational
deterministic
LSP
CLI
codemod
linter
typecheck
ArchUnit / tests
Inferential
LLM judgment
AGENTS.md
Skills
architecture docs
AI code review
LLM-as-judge
architecture review agent

Maintaining only CLAUDE.md or Skills fills the inferential guide quadrant. That is useful, but it does not detect failure. Without computational guides and sensors, the system keeps leaning on LLM judgment.

Still, the goal is not to fill every quadrant for its own sake. If architecture drift is the bottleneck, start with computational sensors. If misunderstandings of local conventions dominate, improve inferential guides. If AI review is expensive, push more checks into linters and tests first.

Mapping to the Earlier Five-Layer Model

Question this section answers: How does the March five-layer model relate to the two-layer and 2x2 models?

The March five-layer model split harness engineering into constraints, information, verification, recovery, and review12. That axis asks what is being controlled. The guide/sensor axis asks when control happens.

LayerGuide (before)Sensor (after)
ConstraintsSandbox boundary
permissions and edit scope
Almost none
the structure rejects it up front
InformationAGENTS.md / Skills
design docs
Context rot monitoring
stale-doc detection
VerificationTypes / contracts / schemas
acceptance criteria
lint / test / ArchUnit
CI gates
Recoveryself-repair prompts
rollback policy
LLM-optimized errors
failure-log summaries
Reviewreview criteria
severity taxonomy
AI review agent
LLM-as-judge

This table is the key contribution of the article. The five layers describe what is regulated; the 2x2 describes when and how regulation happens. They are not competing frameworks. They are orthogonal.

One more pattern appears when reading the rows top to bottom. Constraints and information tend to be more deterministic. Recovery and review are usually more probabilistic and more expensive. Before adding more lower-layer judgment, check whether an upper layer can remove the failure structurally.

Boundaries Move as a Gradient

Question this section answers: How does the builder/user split blur in practice?

The two-layer model is useful, but the boundary is not fixed. The Claude Agent SDK exposes the tools, agent loop, and context management that power Claude Code as a programmable library13. The Codex SDK similarly brings the Codex agent implementation into custom engineering workflows and apps14.

A gradient from user harnesses to builder harnessesThe left side shows using Claude Code as-is, the middle shows custom orchestration with an Agent SDK, and the right side shows building the harness itself.Use Claude Code as-isAGENTS.md / HooksCustom Agent SDK flowGenerator / Evaluator splitBuild the harnessloop / tools / memorymostly user-sidemiddle layermostly builder-sideThe more deterministic steps and agent freedom alternate, the more the work moves toward the middle

This middle layer includes generator/evaluator separation, pipelines that alternate deterministic steps with agent freedom, and custom UIs that connect approvals or session management to an SDK.

Garry Tan's "Thin Harness, Fat Skills" is useful here15. The idea is to keep the harness thin, push judgment into skills, and push execution down into deterministic tools. The harness should not absorb everything.

As SDKs become more common, the boundary will keep moving. Users will borrow builder-side parts while assembling more codebase-specific orchestration around them.

How Much Should Teams Care?

Question this section answers: What should a team do tomorrow?

The practical conclusion is that teams do not need to care too much about the terminology. They need to identify which map matches their bottleneck.

Three questions are enough:

  • Which quadrant of the 2x2 is thin?
  • Which of the five layers can the team afford to strengthen?
  • Does the team really need to build a builder harness?

For most teams, the answer to the third question is no. Use Claude Code or Codex as shipped, then add AGENTS.md, Skills, Hooks, MCP, lint/test, and review loops around the codebase. A minimal starting point is this small.

AGENTS.md
- Read the relevant design docs before editing
- Run `npm test` and `npm run lint` after changing code
- If checks fail, leave the cause and next step briefly

The common trap is stopping at "we wrote CLAUDE.md, therefore we did harness engineering." That fills an inferential guide, but it often leaves out computational sensors, verification loops, and measurement. A system that does not detect failure does not learn from it.

Terminology is only a map. Once the map is in hand, return to the terrain of the team.



  1. Leo Gao et al., "A framework for few-shot language model evaluation", Zenodo, September 2, 2021 https://zenodo.org/records/5371629 

  2. Anisha Agarwal et al., "Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming", arXiv, February 22, 2024 https://arxiv.org/abs/2402.14261 

  3. Mitchell Hashimoto, "My AI Adoption Journey" https://mitchellh.com/writing/my-ai-adoption-journey 

  4. Ryan Lopopolo, "Harness engineering: leveraging Codex in an agent-first world", OpenAI, February 11, 2026 https://openai.com/index/harness-engineering/ 

  5. Birgitta Böckeler, "Harness engineering for coding agent users", Martin Fowler, updated April 2, 2026 https://martinfowler.com/articles/harness-engineering.html 

  6. Harrison Chase, "Your harness, your memory", LangChain, April 11, 2026 https://www.langchain.com/blog/your-harness-your-memory 

  7. Vivek Trivedy, "The Anatomy of an Agent Harness", LangChain, March 10, 2026 https://www.langchain.com/blog/the-anatomy-of-an-agent-harness 

  8. Justin Young, "Effective harnesses for long-running agents", Anthropic, November 26, 2025 https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents 

  9. Prithvi Rajasekaran, "Harness design for long-running application development", Anthropic, March 24, 2026 https://www.anthropic.com/engineering/harness-design-long-running-apps 

  10. Philipp Schmid, "The importance of Agent Harness in 2026", January 5, 2026 https://www.philschmid.de/agent-harness-2026 

  11. Ashpreet Bedi, "Systems Engineering", April 14, 2026 https://www.ashpreetbedi.com/articles/systems-engineering 

  12. SmartScope, "What Is Harness Engineering: What Do You Actually Build?", March 2026 https://smartscope.blog/en/blog/harness-engineering-what-to-build/ 

  13. Anthropic, "Agent SDK overview", Claude Code Docs https://code.claude.com/docs/en/agent-sdk/overview 

  14. OpenAI, "Codex is now generally available", October 6, 2025 https://openai.com/index/codex-now-generally-available/ 

  15. Garry Tan, "Thin Harness, Fat Skills", GitHub, created April 9 and updated April 11, 2026 https://github.com/garrytan/gbrain/blob/master/docs/ethos/THIN_HARNESS_FAT_SKILLS.md