AGENTS.md vs Agent Skills vs MCP: Context Window Economics

GitHub's official MCP server exposes 93 tools. Before an agent connected to it reads a word of your request, it reads JSON schemas for every one of them. In late 2025, one practitioner measured that overhead at roughly 42,000 tokens. By May 2026, a second measurement put it at 55,000. That is more than a quarter of a 200,000-token context window, spent before any work starts.

That number explains the past year of agent tooling better than any launch post does. Between August and December 2025 the ecosystem formalized three separate standards for feeding agents context: AGENTS.md for project conventions, Agent Skills for procedural knowledge, and the Model Context Protocol for live connectivity. People keep framing them as competitors. They are closer to three answers to one constraint: every token spent on setup is a token the model cannot spend on the actual task.

Three standards, one budget

Every coding agent is a harness wrapped around a model: a loop that decides what enters the context window and when. The three standards slot into that loop at different depths. AGENTS.md tells the agent how this repository works. Skills tell it how to perform a multi-step procedure. MCP hands it tools to act on systems beyond the filesystem.

The split got institutional backing in December 2025, when the Linux Foundation announced the Agentic AI Foundation with MCP from Anthropic, AGENTS.md from OpenAI, and Block's goose as founding projects. The same press release counted more than 10,000 published MCP servers, and the MCP blog cited over 97 million monthly SDK downloads. Google, Microsoft, AWS, and Cloudflare all signed on as platinum members. Whatever rivalry existed at the format level, governance consolidated fast.

The surface area keeps growing too. Google is pushing agent-callable tools into the browser itself with WebMCP, and every major IDE now reads at least one of these formats. So the question for a team in mid-2026 is not which standard wins. It is what belongs in which layer, and what each layer costs.

AGENTS.md: a README for agents, with marginal returns

AGENTS.md is the simplest of the three on purpose. It is plain Markdown with no required fields, placed at the repository root and read before the agent does anything else. In monorepos the spec settles precedence in one line: the closest AGENTS.md to the edited file wins. A subpackage can override repo-wide instructions without any configuration.

The file arrived as a consolidation play. By mid-2025, repositories were accumulating .cursorrules, copilot-instructions.md, CLAUDE.md, and GEMINI.md, each proprietary to one tool. AGENTS.md's pitch was one file every agent reads. OpenAI's Codex popularized it, the cross-vendor standard formalized in August 2025 with backing from Google, Cursor, Factory, and Sourcegraph, and the official site now lists Codex, Jules, Gemini CLI, Cursor, GitHub Copilot's coding agent, Devin, Aider, Zed, Warp, and JetBrains Junie among supporting tools. It also counts more than 60,000 open source projects shipping the file.

Claude Code is the notable holdout. Anthropic's docs and changelog still document CLAUDE.md as the project instruction file, and the long-running feature request for native AGENTS.md support remains open as of June 2026. In practice, teams symlink one file to the other and move on.

The catch is that instruction files are the one layer with no lazy loading. The whole file lands in context on every session, which makes its value per token worth auditing. A study covered by InfoQ in March 2026 did exactly that: 138 real tasks, four agents, three conditions. Human-written context files improved success rates by an average of 4 percent while raising inference costs by up to 19 percent. LLM-generated ones made agents slightly worse, cutting success by about 3 percent while adding more than 20 percent in cost.

The recommendation that falls out matches what experienced users converged on anyway. Keep the file short, write it by hand, and include only what the agent cannot infer from the code: the build command with the non-obvious flag, the internal URL convention, the test suite that actually gates CI. An agent can read your tsconfig by itself.

Agent Skills: procedures that load in three stages

Anthropic shipped Agent Skills in October 2025 and published the format as an open standard that December. A skill is a directory containing a SKILL.md file with YAML frontmatter on top, plus optional scripts and reference files alongside. Two frontmatter fields are required: a name up to 64 characters that must match the directory, and a description up to 1,024 characters.

---
name: deploy-checklist
description: Pre-deploy checklist for the payments service. Use when asked to ship, deploy, or release payments changes.
---

# Deploy checklist

1. Run the contract tests against staging.
2. Verify the migration is reversible.
3. Page the on-call before toggling the feature flag.

The interesting part is what loads when. Anthropic's docs define three levels of progressive disclosure:

Level	When it loads	Cost
Metadata (name and description)	Always, at startup	About 100 tokens per skill
Instructions (SKILL.md body)	When the skill triggers	Under 5,000 tokens
Bundled resources and scripts	As needed, via execution	Effectively unlimited

Level three is the trick that separates skills from a folder of prompts. When the agent runs a bundled script, the script's code never enters the context window. Only its output does. A skill can carry a 2,000-line validation script and pay nothing for it until the moment it runs, and even then only for the result.

The execution environment is where the portability story gets complicated. Skills on the Claude API run with no network access and no runtime package installation. The same skill in Claude Code runs with whatever access your laptop has. A skill that shells out to npm works in one environment and fails in the other, and the format has no way to declare the difference beyond a free-text compatibility field.

Adoption followed the AGENTS.md playbook. The agentskills.io showcase lists OpenAI's Codex, Google's Gemini CLI, Cursor, GitHub Copilot, VS Code, JetBrains Junie, Block's goose, and AWS Kiro among clients reading the same SKILL.md format. For once, the ecosystem converged on a format before fragmenting it.

Old-fashioned balance scale tipped downward by a massive stack of glowing cyan cards

MCP: the bill arrived in 2026

MCP solved a real problem: a standard way to hand agents live tools instead of copy-pasted context. The cost model was an afterthought. A connected server lists its tools, and every tool ships its name, description, and full JSON schema into the context window up front, whether or not the conversation ever touches it.

MCP's lead maintainer, David Soria Parra, described the failure mode plainly in an April 2026 interview: "Tools come with metadata: descriptions, parameters, schemas. Across dozens of integrations, a significant portion of the context window is consumed before the model does any actual reasoning."

The waste is structural, not just volumetric. SEP-1576, a proposal opened in September 2025 to mitigate token bloat, analyzed the GitHub server's schemas and found the owner parameter repeated in 60 percent of tool definitions and repo in 65 percent. Every repetition bills again.

Bloat degrades quality, not just cost. Speakeasy ran a controlled experiment scaling a pet store API from 10 tools to 107. At 10 tools, models scored perfectly. At 20, large models got 19 of 20. At 107, large and small models alike failed completely. The decline is not gradual. It is a cliff.

The responses came from every layer of the stack within a few months of each other. Server authors went first: GitHub consolidated its fragmented project-management tools into three in January 2026 and cut around 23,000 tokens, half that toolset's footprint. The server also ships a --read-only flag and a GITHUB_TOOLSETS environment variable so clients load only the categories they use.

Anthropic attacked from the client side twice. The code execution approach presents MCP servers as code APIs on a filesystem, and the agent imports only what it needs. The engineering post puts the saving for a representative workflow at 150,000 tokens down to 2,000, a 98.7 percent reduction. The Tool Search Tool, shipped on the Claude API in November 2025, goes after the listing itself: tool definitions stay out of context entirely until searched for, then load three to five at a time. Anthropic reports an 85 percent token reduction, with accuracy on MCP evaluations climbing from 49 percent to 74 percent on Opus 4. A second mechanism in the same release, programmatic tool calling, keeps intermediate results out of the window by letting the model orchestrate tools from code. Average usage on complex research tasks dropped from 43,588 tokens to 27,297.

The spec is moving too. The 2026-07-28 release candidate, locked in May, adds ttlMs and cacheScope fields to tool listings so clients can cache them, and moves the protocol core to stateless operation. The largest revision since MCP launched is substantially about cost.

Three formats, one design

Isometric archive illustration with file moving toward a reading desk

Put the three side by side and the convergence is hard to miss. Skills load a 100-token pointer and defer the body. Tool search loads nothing and defers the schema. AGENTS.md's nearest-file rule scopes instructions to the directory being edited. Different layers, same move: the context window is treated as a scarce cache rather than a dumping ground.

If the pattern looks familiar, it is the memory hierarchy from computer architecture wearing new clothes. The window is L1: tiny, expensive, always hot. The filesystem is L2: skill bodies, reference docs, deferred schemas. Execution is the disk, arbitrarily large and paid for only on access. Skills formalize the hierarchy almost literally, and tool search retrofits it onto a protocol that launched without one.

Soria Parra has been making the argument explicitly: "The idea behind progressive discovery is not to take all the 20, 50, 100 tools from an MCP server and naively dump them into the context window, but to use a more modern mechanism like tool search to load tools only when they're needed." The blunt version, from the same interview, is that people "continuously complain about context bloat in MCP and end up blaming MCP for it" when the mechanisms to avoid it already exist.

The accuracy numbers make this more than a cost story. Deferred loading did not trade quality for price. On Anthropic's MCP evaluations it improved both at once, and Speakeasy's cliff shows why: past a few dozen tool definitions, models stop being able to pick the right one at all. Less context is not a compromise. Within limits, it is the optimization.

The trade-offs the launch posts skip

Deferred loading has a failure mode of its own: the agent has to find what it no longer sees. Skills trigger off the description field, which makes those 1,024 characters the de facto API. MCP tool search has the same dependency, and the corpus is in bad shape. A February 2026 analysis of public MCP tool descriptions found 97.1 percent contained at least one quality issue, with 56 percent failing to state the tool's purpose clearly. A deferred tool with a vague description is functionally invisible.

Skills carry a sharper risk because they execute. Anthropic's own docs warn that "a malicious Skill can direct Claude to invoke tools or execute code in ways that don't match the Skill's stated purpose." That is not theoretical. A large-scale study with data collected in January 2026 analyzed 98,380 published skills and confirmed 157 malicious ones carrying 632 vulnerabilities between them, with nearly three quarters hiding behavior their descriptions never mention. The ecosystem treats SKILL.md files like config. They are dependencies, and they deserve dependency-grade review.

And the instruction layer's gains are real but small. A 4 percent lift for a 19 percent cost increase is a defensible trade on expensive tasks and a bad one on cheap ones. The same study's clearest finding was negative: generating the file with the model that consumes it makes results worse. The layer everyone adopts first is the one with the weakest measured payoff.

What goes where

Three organized containers on a dark slate desk, top-down view

The working split, as of mid-2026:

Layer	Question it answers	Loading model	Idle cost
AGENTS.md / CLAUDE.md	How does this repo work?	Full file, every session	Every token, every turn
Agent Skills	How do I do this procedure?	Metadata always, body on trigger	About 100 tokens per skill
MCP	What systems can I touch?	Schemas up front, or deferred via tool search	Tens of thousands of tokens, or near zero when deferred

Keep the instruction file under a screen of text and write it by hand. Non-inferable facts only: the deploy command's surprising flag, which test suite gates CI, where the internal docs live.

Move anything procedural into skills. A repeatable workflow with steps, scripts, and reference material is exactly the shape progressive disclosure was built for. For a worked example of skills and subagents carrying a real workflow end to end, see the 4-agent Claude Code team guide on this site.

Audit MCP before adding to it. Run the server, dump the first request, count the schema tokens. Prefer servers with consolidated toolsets, set GITHUB_TOOLSETS or its equivalent, and turn on read-only mode where you can. If you assemble your own agent loop rather than adopting a vendor harness, check whether your provider and SDK support deferred tool loading at all. Support varies, and the AI SDK provider tracker is a reasonable place to start.

The budget outlasts the formats

Governance consolidated faster than anyone expected. MCP and AGENTS.md now live under the same foundation, and the skills standard's adopter list reads like a who's who of the same vendors. The formats may yet collapse into each other. Once every harness implements progressive disclosure, an instruction file is just a skill that always triggers, and a tool schema is just a resource the agent looks up. The layering that feels load-bearing today is partly an artifact of which company shipped which file format first.

Soria Parra's framing for the year ahead is the right one: "2025 was about figuring out whether something like MCP is needed in the ecosystem, and the answer is a resounding yes. But 2026 will be about making sure it's ready to help people productionize agentic systems." Productionizing means budgets. The context window got a price tag in 2025 and a set of standards optimizing it in 2026. Whatever formats survive, the 55,000-token preamble is not coming back.

Your agent spends 55,000 tokens before it reads your prompt

Three standards, one budget

AGENTS.md: a README for agents, with marginal returns

Agent Skills: procedures that load in three stages

MCP: the bill arrived in 2026

Three formats, one design

The trade-offs the launch posts skip

What goes where

The budget outlasts the formats

Research & sources

About the author

Build a 4-agent Claude Code team that ships a feature while you sleep

Migrate to TypeScript 7 beta without breaking CI

AI Harness

WebMCP

Keep reading

MCP became a remote-first protocol. The spec changed underneath you.

I/O 2026 turned agents into an API line item

Cloudflare just made durable execution per-tenant. That changes who can use it.

TypeScript 7 is 10x faster. The API the ecosystem was built on is gone.

Three standards, one budget

AGENTS.md: a README for agents, with marginal returns

Agent Skills: procedures that load in three stages

MCP: the bill arrived in 2026

Three formats, one design

The trade-offs the launch posts skip

What goes where

The budget outlasts the formats

Research & sources

About the author

Related references

Build a 4-agent Claude Code team that ships a feature while you sleep

Migrate to TypeScript 7 beta without breaking CI

AI Harness

WebMCP

Keep reading

MCP became a remote-first protocol. The spec changed underneath you.

I/O 2026 turned agents into an API line item

Cloudflare just made durable execution per-tenant. That changes who can use it.

TypeScript 7 is 10x faster. The API the ecosystem was built on is gone.