In Plain Words

A language model on its own does one thing: it reads text and predicts more text. It cannot open a file, run a command, call an API, or check whether its last answer was correct. An AI harness is the code that gives it those abilities. It takes the model's output, acts on it in the real world, feeds the result back, and asks the model what to do next, over and over, until the job is finished.

That loop is the whole idea. The community shorthand is Agent = Model + Harness: the model supplies the reasoning, the harness supplies everything else. Tools like Claude Code, OpenAI Codex, and the various coding CLIs all call this surrounding layer a harness, and it is usually the part that decides whether an agent finishes a task or stalls halfway through.

How It Works

A harness wraps the model in a control loop. Each turn through that loop runs roughly the same steps:

  • Assemble context. Build the prompt the model will see: the system instructions, the task, recent tool results, and any retrieved memory. When the context window fills, the harness summarizes or trims it.
  • Call the model. Send that context and read back the response, which is either plain text or a request to use a tool.
  • Execute tools. If the model asked to run a command, read a file, or query an API, the harness runs it and captures the output, including errors.
  • Feed results back. Append the tool output to the context so the next call can react to what actually happened.
  • Decide whether to stop. Check whether the task is complete, the budget is spent, or a guardrail tripped. If not, loop again.

Stripped to its core, a harness is a short loop around a single model call:

context = build_context(task)
while not done:
    reply = model.generate(context)          # the model reasons
    if reply.tool_calls:
        results = run_tools(reply.tool_calls)  # the harness acts
        context = append(context, reply, results)
    else:
        done = is_complete(reply)             # the harness decides
return reply

Real harnesses add more around this skeleton: permission prompts before risky commands, retries when a tool fails, self-critique steps, sub-agents, and verification gates such as running the test suite before declaring success. The pattern of reasoning, acting, and observing in a loop is often called ReAct, and tool use itself rides on a model's structured tool-calling interface. How those tools get described and connected to the model is increasingly handled by shared protocols like the Model Context Protocol.

Why It Matters

The harness compensates for what a raw model cannot do. A model that writes code but never runs it has no way to notice a syntax error or a failing test. Wrap that same model in a harness that runs the code, reads the stack trace, and lets it try again, and it finishes tasks the bare model would fail outright.

This is why two products built on the identical model can perform very differently. On agentic coding benchmarks such as Terminal-Bench, the same model can swing by tens of percentage points depending on which harness wraps it, because the harness controls what context the model sees, which tools it can reach, and how many chances it gets to self-correct. Designing that layer well has become its own discipline, sometimes called context engineering or harness engineering.

If you build with agents, the harness is also where your real engineering effort goes. Picking a model is a one-line decision. Choosing which tools to expose, how to manage memory across a long session, when to ask the user for permission, and how to verify output is the work, and it is the part you can actually improve. The shift shows up across the stack, from provider SDKs that standardize tool calling to platform vendors turning agents into first-class API products.

Origin

The word borrows directly from "test harness" in software engineering: the scaffolding that wraps a unit under test to feed it inputs, capture outputs, and make its behavior repeatable. An AI harness plays the same role for a model, feeding it context and capturing its actions. The term spread quickly through 2024 and 2025 because it named something teams were already building, a model plus its loop and tools, without having agreed on a word for it.

Harness, Scaffold, and Model

These three terms get used loosely, sometimes interchangeably. The useful distinction is what each one is responsible for.

LayerWhat it isResponsible for
ModelThe language model itself.Reasoning and generating text or tool-call requests. Nothing else.
ScaffoldWhat the model sees.System prompt, tool descriptions, output parsing, what gets remembered each step.
HarnessWhat runs the loop.Calling the model, executing tools, managing context, deciding when to stop.

In casual use, "harness" often covers the scaffold too. When someone says Claude Code or Codex is "a harness," they mean everything that is not the model. The narrower split above is worth knowing when a discussion turns to who owns which behavior.

A harness is also not the same as an agent framework. A framework like LangGraph or one of the agent SDKs is a toolkit for building a harness. The harness is the specific running loop you end up with. You can write one from scratch in a few dozen lines, or assemble it from a framework.