Who Owns the Context Window?

Part 4 of 6

Boni Gopalan June 19, 2026 13 min read AI

Then I Asked Whether Any of This Should Exist

AIAgentsContext EngineeringDeveloper ToolsToken OptimizationSDLC

Then I Asked Whether Any of This Should Exist

By this point I'd built the thing that decides what my agents see three separate times, and each time I'd been sure the previous version was the reason I needed the next one.

The first time, I bought it. A month's token bill came in heavier than I liked, I traced most of it to transit rather than thinking, and I did what everyone does — reached for middleware. lean-ctx, rtk, headroom, a whole little market of tools, most of them only months old, that sit between your agent and the model and shrink what crosses. That was Part 2.

The second time, I built my own, because I wasn't happy renting the behavior. forge-compress, a small library that clamps bash output and flattens the giant JSON blobs my tools hand back. The part that stuck with me was that the boring heuristics did almost all the work and the clever ideas I'd been proud of did almost none.

The third time, I built something bigger. A context governor — not "make this output smaller" but "what does this agent, in this phase, actually deserve to keep." Policy instead of compression. It was the most ambitious of the three and the one I was proudest of.

Then I benchmarked all of it against the provider's real bill, which was Part 3, and the provider's own prompt cache — which I had done nothing to earn — beat the best tool in the study by about fourteen times, for zero configuration and zero risk.

So before I built a fourth thing, I stopped. I'd spent the better part of a year being the customer for this layer, then the vendor, then the vendor's more ambitious cousin, and I had never once stepped back to ask the obvious thing: should any of this exist, or am I lovingly hand-building something the platforms are about to give away for free.

I didn't trust myself to answer it. I'd sunk too much into the layer to be objective about whether it had a future. So I did what I do when I don't trust my own conclusion, which is make a lot of other minds argue about it.

The method

I won't dress this up as a survey. It was a deep-research pass: 99 agents, 16 sources fetched and read, 78 claims pulled out of them, and then the part that actually matters — 25 of those claims sent through adversarial verification, three independent votes each, the agents told to try to kill the claim rather than confirm it. 23 survived. 2 died. After that, a handful of targeted follow-ups on the harnesses the first pass had covered thinly: Google's Antigravity, GitHub Copilot, OpenAI's Codex, and pi.

I ran it adversarially on purpose. It is very easy, when you've built something three times, to read a market and find reasons you were right to. I wanted claims that survived someone actively trying to knock them down. The two that died are the ones that make me trust the rest: one was a vendor's broad claim that their style of summarization beats the alternatives in general, and three independent checks all called it overreach — accurate about that vendor's own narrow benchmark, false as a law of nature. That's the filter working.

One caveat before any finding, because it changes how hard you should lean on them. This is a snapshot from around June 2026, and several of the most aggressive features I'm about to describe are opt-in betas, not on-by-default behavior. "The platform can do this" is not the same as "the platform does this to everyone today." Hold that gap, it carries weight later.

Six harnesses, six positions

The first thing the research did was take my private little problem and show me six other people already standing in the room, each having decided what context management is and who owns it.

Platform-absorbed ◄──────────────────────────────────────► User/middleware-owned

Google ──── Anthropic ──── Copilot ──── OpenAI ──── OpenCode ──── pi

Harness	Position, in one line
Google / Antigravity	A big window and invisible caching make optimization moot; no hooks
Anthropic / Claude Code	Turn every middleware function into a server-side API primitive
GitHub Copilot	Third parties supply inputs; the harness owns selection and compression
OpenAI / Codex	Own the primitive, let caching economics police the rest
OpenCode	A native baseline plus a real plugin API — open competition on method
pi	Don't optimize context; don't create the waste in the first place

The useful thing here isn't picking a winner. It's that there's no empty field to walk into. Every serious harness already has a context philosophy with code behind it. My governor wasn't a novel idea I was first to; it was one more entry on a spectrum that already ran from "Google does it for you and won't say how" to "pi refuses to make the mess to begin with."

Then the research got specific about the end of the spectrum I least wanted to look at.

What Anthropic absorbed

Anthropic took nearly the entire feature surface these tools now sell and shipped each piece as a first-party primitive, with the execution server-side, inside their own infrastructure, below any layer a third-party tool can reach. And most of it landed before I'd installed a single one of the tools.

The timeline is the uncomfortable part. Anthropic's pieces arrived across late 2025 and into early 2026 — context editing and a memory tool around September and October, a tool-search tool and programmatic tool calling in late November, server-side compaction that January. The middleware I reached for was the same age or newer: headroom and rtk showed up in January 2026, lean-ctx in March. I'd gone hunting for months-old tools to do a job the platform had largely built into its own infrastructure already. Line the functions up and they match almost one to one:

Tool definitions bloating your context? The Tool Search Tool loads them on demand — Anthropic's own figure is about 85% fewer tokens, roughly 77K of upfront definitions down to around 8.7K.
Tool output you don't want in context? Programmatic Tool Calling runs the calls in a code sandbox and only the result comes back; their number is an average drop from 43,588 to 27,297 tokens on complex research tasks, about 37%.
Stale tool results piling up? Context editing clears the oldest ones automatically once you cross a threshold, and the line that matters is "applied server-side before the prompt reaches Claude," with your client keeping the full history. An 84% cut in a 100-turn web-search eval.
Conversation getting long? Server-side compaction summarizes it, and the docs say the quiet part out loud: server-side compaction is the recommended strategy; use client-side only if you specifically need the control.

Every one of those is a vendor metric on a vendor workload, and I'd treat any single number with suspicion. But the direction doesn't depend on the percentages. That last line is the one that made me put my coffee down. I'd written a compression library and a governor that do client-side what the platform now recommends you let it do server-side. forge-compress clamps a tool result after my harness has already produced it; context editing clears it before the prompt is even assembled, on a machine I don't own. You can't out-engineer a position you can't physically stand in.

What's still opt-in

Here's the part the announcements bury, and if you're deciding whether to build in this space it's the most decision-relevant fact in the whole piece. None of those primitives is how the API behaves by default. Each one is gated behind a beta header you have to know exists and send on purpose, and the headers are a matter of public record:

anthropic-beta: context-management-2025-06-27 enables context editing and the memory tool — the server-side clearing of stale tool results and the file-backed offload, both on one shared header.
anthropic-beta: advanced-tool-use-2025-11-20 enables the Tool Search Tool and Programmatic Tool Calling — the on-demand tool loading and the run-the-tools-in-a-sandbox path.
anthropic-beta: compact-2026-01-12 enables server-side compaction, and even with the header set it stays idle until the conversation crosses a trigger you configure — 150,000 tokens by default, 50,000 minimum.

Don't send the header and you get none of it. The 84% and the 85% are real, measured, and switched off out of the box.

So the absorption is capability, not yet default behavior. Anthropic has built the entire middleware surface and is holding most of it behind an opt-in flag, which means the function it's absorbing keeps a market for exactly as long as those flags stay opt-in. The day context editing flips to on-by-default is the day the addressable surface for a standalone compressor closes on the Anthropic rail. I went looking for someone tracking that date publicly and couldn't find one — which tells you how few people are reading the beta headers instead of the headline numbers, and it's a fair hint about where the durable work actually is.

OpenAI and Google

Anthropic was the most aggressive but not the only one, and the other two are interesting precisely because they absorb the same function in completely different ways.

OpenAI commoditized it. Codex auto-compacts as you approach the window, and the Responses API exposes a stateless /responses/compact endpoint — you send the window, you get back a smaller one that is "the canonical next context window." The compacted items come back opaque and encrypted, and you're told plainly not to post-process them, which is its own soft lock: there's nothing in there for a middleware tool to improve. But the deeper move isn't the endpoint, it's a sentence they published about caching: "context engineering and prompt caching are inherently at odds — one prioritizes dynamism, the other stability." That's them telling the entire rewrite-style middleware category that the economics are against it, without naming a single tool. My Part 3 bill was that sentence with a dollar figure attached.

Google did the strangest thing of the three and tried to make the question disappear. Gemini 3 ships a million-token window, positioned as making optimization less necessary — just put the monorepo in. Antigravity, their agent IDE, manages context structurally instead of by compressing: it writes artifacts and knowledge items to disk and reloads them, all handled monolithically inside the platform with no pipeline hooks for anyone outside. And at the API, implicit caching is on by default — the docs literally say "there is nothing you need to do" — at roughly a 90% discount on the cached tokens. Worth noting the seam: the Gemini CLI does ship real, tunable compaction that kicks in near 50% of the window, which tells you the big-window argument doesn't fully hold even inside Google. That part came from single-agent research against their docs rather than the adversarially-voted pass, so I hold it a little looser.

Three companies, three appetites, one lunch. Anthropic rebuilds the tool as a primitive. OpenAI turns it into an endpoint and prices the alternative out. Google buries it under a window big enough that you stop noticing.

Three forces working against middleware

What turned this from an interesting competitive survey into a reason to reconsider my roadmap was realizing the absorption isn't a fad somebody reverses with a better tool. Three structural forces push the same direction.

Server-side execution is the first. Once the optimization runs inside the provider's infrastructure, before the prompt reaches the model, no client-side tool can get under it. Middleware lives above the line, and the line is moving down.

Caching economics is the second, and it's the one with teeth. Providers discount cached tokens steeply, on the order of 90%, but the cache keys on an exact, stable prefix. Rewrite-style compression mutates earlier context to shrink it, which breaks the prefix, which throws away the discount. So the harder you compress, the more cache you can destroy, and past a point you spend more than you save. That's not a theory I'm advancing, it's the Part 3 bill generalized. The cheapest token is the one you don't resend, and you get that by holding context stable, which is the opposite of what a compressor does.

Feature-surface duplication is the third. Progressive disclosure — keep tool and skill definitions out of context until they're needed — was arrived at independently by Anthropic's tool search, Google's skills, Copilot's virtual tools, and pi's lazy skills. When four harnesses that agree on nothing else all ship the same primitive, it has stopped being a product and become a commodity. You don't build a company on a commodity, you get it free with the model.

Reading it as my own roadmap

Somewhere in the middle of this I noticed I'd stopped reading a market report and started reading a triage of my own work.

forge-compress, the library I'd written, is a wasting asset. "Compress the tool output before it reaches the model" is exactly the function being absorbed, on exactly the timeline that makes building more of it a bad bet. It still works and it'll keep working. It's just no longer a place to invest, because the platforms are shipping that roadmap and undercutting it with server-side execution and cache economics I can't match.

The governor held up better under the same light, and the reason is the one genuinely useful thing the research handed me. The function being commoditized is generic compression. The thing that isn't — the thing no provider will ever do for me — is policy specific to my harness: what a planning agent in phase three of my pipeline, with my personas and my gates, is allowed to carry forward. That doesn't live at the provider, because the provider doesn't know my phases. It lives in the harness. And the harness is the one structural home on the whole spectrum where platform absorption shows up as a plugin you can adopt rather than a feature that erases you — OpenCode ships a real plugin API, pi has an extension system, and that's where merging beats extinction.

There was also a quieter irony I couldn't unsee once I'd seen it. forge-cli, the body of Forge I'd been pouring all this context engineering into, runs on pi. And pi's entire stated philosophy is the opposite of build #1: don't optimize the context, don't create the waste in the first place — a system prompt under a thousand tokens, four tools, skills that stay dormant until you call them. Mario's line is that the frontier models have all been trained up the wazoo and already know what a coding agent is, so the heavy scaffolding is just tokens you're choosing to pay for. My own foundation had been quietly arguing against the middleware I adopted the whole time I was adopting it.

Where the edges survive

It would be a tidy ending to say the whole layer is dead, but the research wouldn't support it and I don't believe it. There are edges that survive. They're just narrower than the market thinks.

Method can still differentiate, as a feature rather than a moat. Factory's summarization approach genuinely outscored Anthropic's and OpenAI's on their own benchmark — and I'll say "their own benchmark" twice, because the broader claim that their style wins in general is the one my research killed 0-3. A narrow, real, self-reported edge. Take it for exactly that.

Open harnesses with plugin surfaces are the real survivor. Middleware as a sanctioned plugin, competing on quality against a native baseline, is a structure where the platform absorbing you means adopting you. That's a fine business. It's just a different one from selling a standalone box that sits in the wire.

And the edge I find myself most drawn to: cross-provider economics and observability. No single platform has any incentive to build honest, value-per-token accounting that spans all the providers — Anthropic won't tell you when Gemini is cheaper for your workload. That's a real gap on the demand side, opening exactly as token billing makes the waste visible enough that people start asking about it.

Those edges are where the series goes from here.

{
  "items": [
    {
      "q": "Who checks the provider's meter?",
      "body": "The observability edge is the survivor I believe in most — value-per-token accounting no single platform will build across its rivals. But it rests on a question I'd been ducking: the provider's bill is the one meter that actually debits money, and the per-request detail behind it evaporates before it reaches any dashboard you can see. <strong>That's worth asking carefully, not loudly.</strong>",
      "tag": "→ the meter goes on trial in part 5"
    },
    {
      "q": "So where should the layer live?",
      "body": "If generic compression is a commodity the platforms now give away, and the durable value is harness-specific policy, then the thing worth finishing isn't the compressor — it's the governor, and it belongs in the harness. That's the conviction this whole survey pushed me toward. <strong>It's also where the series lands.</strong>",
      "tag": "→ the verdict is part 6"
    }
  ]
}

I started the year as the customer for this layer, became its vendor, and built its most ambitious version. The research didn't tell me I'd wasted the year. It told me which third of it to keep.

This is Part 4 of "Who Owns the Context Window?" — a series on where context management should live, told through the building of one system that had to answer it. The deep-research pass behind this part covered Anthropic, OpenAI, Google, GitHub Copilot, OpenCode, and pi. Forge is open source: the Claude Code plugin and forge-cli, which runs on the pi coding agent.

About Boni Gopalan

Elite software architect specializing in AI systems, emotional intelligence, and scalable cloud architectures. Founder of Entelligentsia.