AI Harness

ByAbhi Panseriya— Fullstack Engineer at Carousell

Pub 30 May 20264 min read

Isometric illustration of a glowing neural cube wired to functional modules.

In Plain Words

A language model on its own does one thing: it reads text and predicts more text. It cannot open a file, run a command, call an API, or check whether its last answer was correct. An AI harness is the code that gives it those abilities. It takes the model's output, acts on it in the real world, feeds the result back, and asks the model what to do next, over and over, until the job is finished.

That loop is the whole idea. The community shorthand is Agent = Model + Harness: the model supplies the reasoning, the harness supplies everything else. Tools like Claude Code, OpenAI Codex, and the various coding CLIs all call this surrounding layer a harness, and it is usually the part that decides whether an agent finishes a task or stalls halfway through.

How It Works

A harness wraps the model in a control loop. Each turn through that loop runs roughly the same steps:

Assemble context. Build the prompt the model will see: the system instructions, the task, recent tool results, and any retrieved memory. When the context window fills, the harness summarizes or trims it.
Call the model. Send that context and read back the response, which is either plain text or a request to use a tool.
Execute tools. If the model asked to run a command, read a file, or query an API, the harness runs it and captures the output, including errors.
Feed results back. Append the tool output to the context so the next call can react to what actually happened.
Decide whether to stop. Check whether the task is complete, the budget is spent, or a guardrail tripped. If not, loop again.

Stripped to its core, a harness is a short loop around a single model call:

context = build_context(task)
while not done:
    reply = model.generate(context)          # the model reasons
    if reply.tool_calls:
        results = run_tools(reply.tool_calls)  # the harness acts
        context = append(context, reply, results)
    else:
        done = is_complete(reply)             # the harness decides
return reply

Real harnesses add more around this skeleton: permission prompts before risky commands, retries when a tool fails, self-critique steps, sub-agents, and verification gates such as running the test suite before declaring success. The pattern of reasoning, acting, and observing in a loop is often called ReAct, and tool use itself rides on a model's structured tool-calling interface. How those tools get described and connected to the model is increasingly handled by shared protocols like the Model Context Protocol.

Why It Matters

The harness compensates for what a raw model cannot do. A model that writes code but never runs it has no way to notice a syntax error or a failing test. Wrap that same model in a harness that runs the code, reads the stack trace, and lets it try again, and it finishes tasks the bare model would fail outright.

This is why two products built on the identical model can perform very differently. On agentic coding benchmarks such as Terminal-Bench, the same model can swing by tens of percentage points depending on which harness wraps it, because the harness controls what context the model sees, which tools it can reach, and how many chances it gets to self-correct. Designing that layer well has become its own discipline, sometimes called context engineering or harness engineering.

If you build with agents, the harness is also where your real engineering effort goes. Picking a model is a one-line decision. Choosing which tools to expose, how to manage memory across a long session, when to ask the user for permission, and how to verify output is the work, and it is the part you can actually improve. The shift shows up across the stack, from provider SDKs that standardize tool calling to platform vendors turning agents into first-class API products.

Origin

The word borrows directly from "test harness" in software engineering: the scaffolding that wraps a unit under test to feed it inputs, capture outputs, and make its behavior repeatable. An AI harness plays the same role for a model, feeding it context and capturing its actions. The term spread quickly through 2024 and 2025 because it named something teams were already building, a model plus its loop and tools, without having agreed on a word for it.

Harness, Scaffold, and Model

These three terms get used loosely, sometimes interchangeably. The useful distinction is what each one is responsible for.

Layer	What it is	Responsible for
Model	The language model itself.	Reasoning and generating text or tool-call requests. Nothing else.
Scaffold	What the model sees.	System prompt, tool descriptions, output parsing, what gets remembered each step.
Harness	What runs the loop.	Calling the model, executing tools, managing context, deciding when to stop.

In casual use, "harness" often covers the scaffold too. When someone says Claude Code or Codex is "a harness," they mean everything that is not the model. The narrower split above is worth knowing when a discussion turns to who owns which behavior.

A harness is also not the same as an agent framework. A framework like LangGraph or one of the agent SDKs is a toolkit for building a harness. The harness is the specific running loop you end up with. You can write one from scratch in a few dozen lines, or assemble it from a framework.

Frequently asked

Questions & answers

What is an AI harness?

An AI harness is the code that wraps a language model and turns it into an agent. It runs the model in a loop, executes the tools the model requests, manages the context window, and decides when the task is finished.

What does Agent = Model + Harness mean?

It is the common shorthand for how agents are built: the model supplies the reasoning, and the harness supplies everything else, including the loop, tool execution, memory, and stopping logic. The harness is everything in an agent that is not the model.

What is the difference between a harness and a scaffold?

The scaffold is what the model sees: the system prompt, tool descriptions, and parsing. The harness is what runs the loop: calling the model, executing tools, managing context, and deciding when to stop. In casual use, harness often covers both.

Why does the harness matter more than the model sometimes?

Because the harness controls what context the model sees, which tools it can reach, and how many times it can self-correct. On benchmarks like Terminal-Bench, the same model can score very differently depending on which harness wraps it.

Is an AI harness the same as an agent framework?

No. A framework such as LangGraph or an agent SDK is a toolkit for building a harness. The harness is the specific running loop you end up with, which you can also write from scratch in a few dozen lines.

Research & sources

Primary references reviewed while compiling this glossary.

01
Harness, Scaffold, and the AI Agent Terms Worth Getting Righthuggingface.co
02
Harness engineering for coding agent users (Martin Fowler)martinfowler.com
03
What Is an Agent Harness? (Firecrawl)firecrawl.dev
04
What is an agent harness in the context of large-language models? (Parallel)parallel.ai

About the author

Abhi Panseriya

Fullstack Engineer at Carousell

Fullstack developer publishing daily blogs on fullstack, frontend, and backend engineering.

Permanent companion pieces - guides, comparisons, glossary entries, and live trackers.

Keep reading

A curated selection of engineering blogs recommended for you next.

google-io22 May 2026

I/O 2026 turned agents into an API line item

Three days after the I/O 2026 keynote, the line that matters is not a model. It is pay-per-run agent pricing.

9 min read

typescript27 May 2026

TypeScript 7 is 10x faster. The API the ecosystem was built on is gone.

TypeScript 7's Go compiler is 10x faster, but the Strada API that typescript-eslint, ts-morph, and transformers import does not cross the language boundary.

9 min read

supabase21 May 2026

Supabase shipped a handler. The line about agents is the news.

Supabase shipped @supabase/server, collapsing Edge Function boilerplate to one wrapper - and admitting its shape was designed for coding agents.

9 min read

cloudflare17 May 2026

Cloudflare just made durable execution per-tenant. That changes who can use it.

Dynamic Workflows lets a Workers deployment run a different workflow definition per tenant. Five years of durable-execution orthodoxy assumed the opposite.

7 min read

AI Harness

In Plain Words

How It Works

Why It Matters

Origin

Harness, Scaffold, and Model

Questions & answers

Research & sources

About the author

Migrate to TypeScript 7 beta without breaking CI

AI SDK provider support

Qwik vs Astro: Which to Pick in 2026

WebMCP

Keep reading

I/O 2026 turned agents into an API line item

TypeScript 7 is 10x faster. The API the ecosystem was built on is gone.

Supabase shipped a handler. The line about agents is the news.

Cloudflare just made durable execution per-tenant. That changes who can use it.

In Plain Words

How It Works

Why It Matters

Origin

Harness, Scaffold, and Model

Questions & answers

Research & sources

About the author

Related references

Migrate to TypeScript 7 beta without breaking CI

AI SDK provider support

Qwik vs Astro: Which to Pick in 2026

WebMCP

Keep reading

I/O 2026 turned agents into an API line item

TypeScript 7 is 10x faster. The API the ecosystem was built on is gone.

Supabase shipped a handler. The line about agents is the news.

Cloudflare just made durable execution per-tenant. That changes who can use it.