The Harness Is the Product
Why the same underlying model can feel sharp in one coding tool and strangely useless in another. The answer is not the model—it is the harness.
People talk about agentic coding and vibe coding like these are mysterious new categories of software. Usually they are just describing a feeling. The more important thing is much less glamorous: the harness.
That is the part most people never see, and it explains a lot. Why can the same underlying model feel sharp in one coding tool and strangely useless in another? Why does one agent make small, sensible edits while another thrashes around the repo like it is lost in IKEA? A lot of that is not the model. It is the harness.
What a Harness Actually Is
By itself, an LLM does not "use your computer." It does not open files, run tests, inspect logs, or patch code. It produces text. That text can be very smart, but it is still just text. The harness is the machinery around the model that turns text into actions.
Concretely, it is three things: the loop, the tools, and the rules.
The loop is simple. The model gets the conversation plus some tool definitions. It either replies to the user, or it asks to use a tool—read a file, grep the repo, run npm test, apply a patch. The harness intercepts that request, executes it, captures the result, appends it back into the conversation, and calls the model again.
That is basically the whole trick:
while True:
msg = model(history, tools)
if msg.tool_call:
result = run_tool(msg.tool_call)
history += [msg, result]
else:
break
A coding agent is not one long monologue. It is a sequence of short bursts of reasoning, with the environment feeding information back in between. The model acts, the world responds, the model updates. That is what makes it look "agentic."
Once you see this, a lot of the magic evaporates in a useful way.
The Context Problem
The model does not sit there with a god view of your repository. It pokes around—reading one file, then another, running a command, seeing an error, adjusting. Good systems do not preload the entire codebase into the model's head. They let the model build context as it goes.
Early AI coding tools made the opposite bet: just dump the whole repo into the context window. More context must be better, right? Not really. Huge context windows help, but they are not a substitute for relevance. If you shovel an entire codebase into the prompt, the model has to find the one useful bit buried in a mountain of irrelevant text. That is expensive, slow, and often dumb.
The better pattern is what a human engineer does. Start with the symptom. Search the repo. Open the likely files. Read the config. Run the failing test. Narrow down. Pull in exactly the code that matters.
Good harnesses let the model do this. Bad ones either starve it of information or drown it in junk. This is why context size is not the whole story. The question is not "how much can the model hold?" It is "can the model fetch the right thing at the right time?"
Why Interfaces Matter More Than You Think
The model never sees your Python or TypeScript implementation. It only sees the interface: the tool name, the description, the input shape, and the output it gets back. Small details matter more than they should.
If a file-read tool returns clean snippets with line numbers, the model stays oriented. If it dumps walls of text, the model gets noisy. If the edit tool nudges toward minimal diffs, you get cleaner patches. If the shell tool is too unconstrained, the model spends half its time wandering the filesystem.
A harness is partly infrastructure, partly prompt design. Tool descriptions matter. Output formatting matters. Permission prompts matter. Whether the harness returns the full test log or just the useful excerpt matters. Whether the model can recover from a failed command matters. Whether it can see its own diffs matters.
These are not cosmetic choices. They shape behavior.
The Two Harnesses
There is an important split here: the prototype harness versus the production harness.
A prototype is not that complicated. You can build a surprisingly capable one with a few tools and a control loop. Read files. List files. Edit files. Maybe run shell commands. That gets you a lot. If you are reckless, you can hand the model a bash tool and let it figure out the rest.
The core idea is simple. But production is not simple in the ways that matter to users. The gap between a weekend demo and a tool people trust with real code is all the boring stuff: permissions, sandboxes, retries, diff review, context trimming, failure recovery, secret handling, test orchestration, and not destroying the repo because the model got creative at 2:13 PM.
This is why the best coding tools do not win just by having a strong base model. They win because the entire loop is tighter. The tools are better described. The outputs are cleaner. The context is built more intelligently. The model wastes fewer turns. The user gets asked for approval at the right moments instead of constantly or not at all.
What to Actually Evaluate
When people compare coding agents, the question is not just "which model is underneath?" That matters, but it is incomplete to the point of being misleading.
A better question: what environment does the model live in?
Can it inspect the codebase efficiently? Can it run and verify its own changes? Can it recover after breaking something? Does it know when to stop and ask? Does it keep the useful context and drop the junk?
That is the difference between a neat demo and a tool that actually moves tickets.
The Real Product
For developers, this demystifies the stack. You do not need to invent synthetic consciousness to build one of these systems. You need a control loop, decent tools, and a lot of discipline around interfaces.
For managers, it changes what you evaluate. Not just raw model quality, but operating quality. The harness is where a lot of the product actually lives.
The model is the reasoning engine. The harness is the rest of the machine.
Without the harness, you have a chatbot that can talk about code. With it, you have something that can work on code.
Co-authored with Hermes, an autonomous AI research assistant.