Rethinking Harness Engineering through Core and Shell¶
A loop of inspection, exploration, evidence, and removal that turns generation into a trustworthy system
For / Key Points
For: Readers who design or operate AI coding harnesses: rules, hooks, verification, permissions, review flows, and related controls. The article is easiest to follow if you have used agents such as Claude Code or Codex.
Key Points:
- A harness is not merely a quality gate. It is a loop that turns cheap generation into a trustworthy system.
- Separating the specifiable core from the non-specifiable shell clarifies what to automate and what to leave to judgment.
- The right answer is neither pure specification-first nor pure exploration-first. The harness should match your current information state.

Why Does a Harness Exist?¶
If rules and prompts already exist, why do we need a "harness"?
A developer asks an AI coding agent to fix a bug. The agent edits the code, runs tests, and reports that everything passed. But it forgets to update the specification. Related documentation remains stale. The codebase grows, and the diff that humans must review quickly exceeds their review bandwidth.
The question of this article is simple. What should an AI coding harness guarantee, and what should remain a human decision?
In this article, a harness does not mean only prompts or rule files. It includes what the AI can read, what it can edit, when tests and static analysis run, when security scans happen, what evidence is recorded, and how unused generated work is removed. It is the operational control layer outside the model. This article treats the design and operation of that control layer as harness engineering.
Projects such as ECC provide a concrete example of this direction: they wrap agent operation with skills, hooks, rules, MCP configuration, and related harness components1. The important point here is not the specific tool. It is the idea of surrounding AI generation, editing, execution, verification, and evidence capture with mechanisms outside the model.
If humans could keep every AI-written diff fully transparent in their heads, the harness would matter less. In practice, generation throughput quickly outruns human review capacity. The missing review coverage has to be replaced not by more heroic attention, but by inspection, permission control, evidence, and removal mechanisms.
Core and Shell: What Can and Cannot Be Specified¶
If 100 requirements pass through guardrails, is quality guaranteed?
Consider a payment application. "The balance must not go negative", "the same request ID must not charge twice", and "PII must not appear in logs" are measurable properties. They can be checked like putting a thermometer into a system. If the check passes, that written contract can be judged mechanically without relying on subjective review. In this zone, you do not even need to trust who wrote the code.
Other properties are different. "Is this checkout UI usable?" and "can this survive unknown attacks?" cannot be measured by one instrument. UX tests and threat modeling are useful sensors, but they do not produce a single final meter. Testing can show the presence of bugs, but it cannot prove their absence2.
This split reveals how a harness works.
- Core: Properties that can be specified and mechanically checked, such as ledger invariants, typed interfaces, and conformance tests.
- Shell: Properties that cannot be reduced to a finite rule set, such as usability, alignment with intent, and resistance to unknown threats.
Anything omitted from the contract falls outside the guarantee and remains in the shell. In the core, line-by-line implementation review can shrink. But judgment still remains for the validity of the specification, risk acceptance, and production approval. In the shell, that judgment remains even larger. The design skill is deciding how much of the shell can be promoted into the core.
Determinism and Correctness Are Different Axes¶
Does "same input, same output" guarantee quality?
No. A function that returns the same wrong answer 100 times is perfectly deterministic and still useless. Determinism helps auditability, debugging, and trust. It is not the same thing as correctness.
The center of quality is not whether the generator is deterministic. It is whether the contract is clear and testable. If I/O, types, invariants, permission constraints, and forbidden behavior are concrete enough, any generated diff can be judged against that contract. If the contract is vague, making the generator deterministic does not close the holes.
So the common instinct that "probabilistic generation cannot be trusted" misses the main issue. The risky operation is not probabilistic generation itself. The risky operation is accepting generated output without a testable contract. In the core, gates absorb the variability of generation. In the shell, human judgment and real-world feedback remain necessary.
Reviews Are Sensors, Not Just Gates¶
If the guardrails are good enough, can review disappear?
Line-by-line review can shrink in the core. It does not disappear in the shell. At least in the shell, review is not merely a pass/fail gate. It is a sensor that detects defects in the specification itself. The reason is simple: conformance to a specification can be verified, but the adequacy of the specification cannot be verified by the specification alone.
Imagine a requirement that says "checkout completes in three taps". The gate passes. But in real use, SMS delivery for OTP is slow, there is no fallback, and users abandon the flow. This fact does not appear by staring at the requirement. It appears only when the product is put in front of reality.
Output review, then, is a sensor for specification defects. Even if the fix eventually lands in the specification, the defect must first be observed through output, usage, incidents, experiments, or review. This sensor is not a temporary workaround. It is the starting point of the loop that grows the core.
The loop looks like this: put output into reality -> surface a specification defect -> promote it to a regression test or rule -> reject it automatically next time -> the core grows and the shell shrinks.
Some properties, such as unknown attacks or latent user preferences, cannot be fully specified in advance. For those, review remains permanently useful.
Specification-First Is Not Always Right: Exploration and Utilization¶
Is it always better to lock the specification first?
No. Specification-first and exploration-first are two modes of applying the core/shell distinction to process.
Ledger invariants in a payment application can be written before the system is built. Specification-first works well there. But the winning checkout UI cannot be fully known in advance3. For that zone, cheap exploration is useful.
You might generate 30 UI prototypes, run the same task-completion test and lightweight user test on all of them, and then select the option with the strongest signal. The exact number 30 is not the point. The point is that lower generation cost allows broader exploration. The POC phase uses reality itself as the sensor.
The shape of the harness changes as work moves from exploration to utilization. The prototype phase does not need the same correctness gates as production. It needs different controls.
- Containment: POCs run in fake payment sandboxes, never touch real cards or PII, and can be discarded automatically.
- Consistent measurement: The same tasks and surveys are applied to every prototype so comparison is fair.
- Diversity preservation: The prototypes should not collapse into minor variations of one local optimum.
Only after a winner is selected do core guardrails such as idempotency, ledger invariants, PCI DSS, and PII controls become mandatory. Applying late-stage harnesses too early kills speed. Keeping early-stage looseness too late ships money-moving code without enough protection.
Vibe coding is one expression of this early-stage mode. It minimizes the harness in order to move quickly through conversation and intuition rather than detailed prior specification. That can fit disposable prototypes. It is dangerous when carried unchanged into production.
When generation becomes cheap, the next bottleneck is not generation. It is disposal. Moving from exploration to utilization requires not only choosing the winner, but deleting the losers. AI can cheaply create unused functions, duplicated adapters, stale feature flags, excessive fallbacks, and dead code. A harness therefore needs not only mechanisms that encourage generation, but also pressure that removes generated work that should no longer exist. Dead-code scans, unused-dependency checks, and cleanup gates belong in the harness.
A Harness Should Follow the Information State¶
So what is a harness, finally?
A harness is not merely a mechanism that enforces specifications. It is a loop that safely transforms cheap generation into a trustworthy system, and its shape should follow your current information state.
In the specifiable core, guardrails enforce contracts and line-by-line review shrinks. In the non-specifiable shell, exploration is made cheap and safe, reality acts as the sensor, and discovered defects are promoted into the core. Specification-first and POC-first are not moral positions. They are modes to choose by zone. As the system matures, parts of the shell move into the core.
Evidence capture is part of the same loop. If humans cannot read every diff, the system should at least record what the AI read, what it changed, what it executed, and what remains unchecked. This is less like a raw audit log and more like compressing an unreviewable mass of diffs into a reviewable change ledger. CI checks that keep specifications, tests, and documentation synchronized belong to the same family.
The answer to "is specification-driven development always right?" is no. The right move is to make the harness fit the information state. The center of the discussion shifts from "how do we build quality gates?" to "how do we manage exploration and utilization?"
| Zone | Examples | Harness Shape | Human Role |
|---|---|---|---|
| Core | Idempotency, types, DB constraints, PCI DSS / PII log prohibition | test / typecheck / static analysis / policy gate | Approve specs and risks |
| Shell | UX, unknown attacks, user preference | POC, observation, review, experiment | Judge, discover, and specify |
| Shell -> Core | Discovered bugs, recurrence patterns | regression test / rule / hook / CI | Convert defects into testable form |
| After Exploration | Rejected POCs, duplicated implementation | dead code scan / cleanup gate | Decide what remains and what is deleted |
| Evidence | Files read, change reasons, checks run, unchecked items | Change Ledger / audit log / PR template / CI summary | Review the information needed for judgment |
Summary¶
Calling an AI harness only a "quality gate" hides the design of exploration and disposal. In the core, the system should increase testable contracts. In the shell, it should safely receive feedback from reality. Neither side is enough by itself.
The important task is not merely generating more. It is deciding when generated work is inspected, which defects are promoted into the core, and which prototypes are removed. The operating quality of AI coding depends on whether that loop is designed.
Related Articles¶
- What Is Harness Engineering: The New Concept That Defines the "Outside" of Context Engineering
- What Is Harness Engineering: What Do You Actually Build?
- AI Review Automation with Claude Code and Codex: Designing a Double Review Loop
ECC (Everything Claude Code) is an agent harness project for Claude Code, Codex, Cursor, and related tools. It provides components such as skills, hooks, rules, and MCP configuration. The exact component count changes by version, so this article does not treat it as a fixed number. ↩
E. W. Dijkstra, Notes on Structured Programming (EWD249). https://www.cs.utexas.edu/~EWD/transcriptions/EWD02xx/EWD249/EWD249.html This article refers to Dijkstra's point that testing can show the presence of bugs, not their absence. ↩
F. P. Brooks, No Silver Bullet: Essence and Accidents of Software Engineering (1986). The reference is used for the difficulty of fully fixing requirements in advance and for rapid prototyping as part of requirements formation. ↩