Who Owns the Context Window?

Part 3 of 6

Boni Gopalan June 17, 2026 14 min read AI

The Three Meters Never Agree: Putting lean-ctx, rtk, and headroom on the Bench

AIAgentsContext EngineeringDeveloper ToolsToken OptimizationSDLC

The Three Meters Never Agree: Putting lean-ctx, rtk, and headroom on the Bench

We build and run Forge, an automated coding agent, and every month we pay its token bill. So when a tool shows up promising to cut that bill, I want proof before I bolt it onto anything. The whole category of "context-management" add-ons sells the same pitch: plug one in and your agent burns fewer tokens, so you pay less. Fine. I wanted that checked against what the provider actually charges us, not against the numbers the tools print on their own dashboards.

So I kept Forge fixed as the harness in every run and swapped three of the best-known tools in the category through it, each one doing something genuinely different. rtk rewrites shell commands to be terse. headroom compresses the traffic on the wire. lean-ctx runs a context store and a cache. Each ran inside the same Forge pipeline, on the same real coding task, measured against Forge running with nothing added. Fourteen runs in the frozen matrix, token counts pulled straight from the provider's bill, under a protocol I committed to git before the first run. About 76 million tokens of our own money, all told.

Here's what came back. Forge's plain baseline already swung wildly from run to run, anywhere from 1.9 to 2.7 million tokens, and that random variation was about ten times bigger than any tool's effect. Against that floor, none of the three tools beat Forge with nothing added. That's not a hit piece, and it's not a verdict on the products. All three do exactly what they were built to do, and the reason none of them helped here is almost entirely about the harness and almost nothing about their quality. Two of the maintainers shipped fixes because of this benchmark, and one merged a change I sent them. The real finding turned out to be that the three meters never agreed with each other: the vendor's, the dashboard's, and the bill's.

The setup

The question wasn't academic, it was operational: do we actually need an external token manager bolted onto Forge? The vendor dashboards all say yes. I wanted the provider's bill to either confirm that or kill it.

That premise is also the design, and I'd rather state it plainly than bury it. Forge is our own product, and it's the fixed element in every run, held constant and swapped against nothing. I don't think that's a conflict to apologize for; it's what makes the comparison clean. The task doesn't change, the pipeline doesn't change, the model tier doesn't change. The only thing moving between runs is the tool under test. One instrument, applied the same way to the native baseline and to all three tools, so the bill can do the talking.

So: the same real task every time, a multi-step code change run through Forge's full pipeline — plan, review-plan, implement, review-code, validate, writeback, commit. Same models, same golden-reset starting state. And one meter for all of it: the provider's billed input tokens, pulled straight from the harness transcripts request by request, phase by phase. Not the vendor's benchmark page, not the product's dashboard. What we were actually charged.

Three tools, each picked because it cuts at a different layer, and each one tested against a review-request issue I filed with its maintainer first:

rtk rewrites shell commands to be terse — it acts on command output (issue #2292).
headroom compresses the request traffic on the wire — it acts on every request (issue #645).
lean-ctx manages a context store, a cache, and a set of injected usage rules — it acts on reads and context (issue #361).

Five native baseline runs (call them A0), three runs per tool. And because I was benchmarking other people's work in public, every maintainer got that issue with the full setup before this published, a standing invitation to tell me I'd wired it up wrong, and a promise to quote them word for word. More on what they said later.

One scope note before the numbers, because it decides whether any of this is even about you. We weren't testing these tools the way most people use them — a developer chatting with a codebase through something like Claude Code, long-lived conversation, warm cache, which is exactly the setting these tools are built for. There, any of the three can genuinely help. We were testing the other thing: agentic, long-running software engineering, an automated pipeline that spins up a fresh context for each phase and grinds a task end to end. I think that's where this is all heading, which is why the test matters, and also why you can't take these results and apply them straight to the interactive case.

The noise floor

The most important number in the whole study isn't a result at all. It's the noise.

Before you can say a tool moved the bill, you have to know how much the bill moves on its own — same task, same harness, nothing added, just run it again. So I ran native Forge five times. Here's what identical inputs gave me:

{
  "figure": "FIG. 1",
  "heading": "FIVE IDENTICAL RUNS, NOTHING ADDED",
  "title": "billed input tokens per native (A0) run — same task, same harness, same models",
  "items": [
    { "label": "R14", "value": 1894592, "display": "1.89M" },
    { "label": "R13", "value": 2122389, "display": "2.12M" },
    { "label": "R01", "value": 2183153, "display": "2.18M  ← median" },
    { "label": "R09", "value": 2693419, "display": "2.69M" },
    { "label": "R05", "value": 2705356, "display": "2.71M" }
  ],
  "total": "A0 noise band <strong>1,894,592 – 2,705,356</strong> · median 2,183,153 · a <strong>±19%</strong> spread with nothing changed but the dice"
}

Nothing changed between those runs except the path the model took through the work. The spread is 810,000 tokens, about 37% of the median, and almost all of it comes from one place: the commit phase, which went from 23K tokens on a quiet run to 927K on a chatty one. An agentic loop is a stochastic process. The same task might take 159 turns or 230. Where the model decides to be thorough and where it decides to be quick moves the bill more than any tool I was about to test.

That's the bar, and it's brutal. Any honest claim about a tool's effect has to clear a band that's already ~810K tokens wide. Which makes this a pre-registered pilot — N=3 per tool, one task, one pipeline, one model tier — not a definitive benchmark. The rule I froze before the runs: a tool's median has to land outside the A0 band to count as a real effect. Inside the band, I can't tell it apart from doing nothing.

The result

None of the three cleared it.

{
  "figure": "FIG. 2",
  "heading": "PER-ARM MEDIAN vs THE NATIVE BAND",
  "title": "billed input tokens, median of 3 runs per arm — lower is better, native = 2.18M",
  "items": [
    { "label": "A0 native", "value": 2183153, "display": "2.18M  (band floor 1.89M)" },
    { "label": "a2 headroom", "value": 2571467, "display": "2.57M  +17.8%  · inside band" },
    { "label": "a3 rtk", "value": 2839459, "display": "2.84M  +30.1%  · above band" },
    { "label": "a1m lean-ctx", "value": 3015055, "display": "3.02M  +38.1%  · above band" }
  ],
  "total": "<strong>No arm achieves a net token reduction that clears the noise floor.</strong> Best case (headroom) merely lands <em>inside</em> the band — no better than native — despite genuine ~5–8% wire compression"
}

The easy reading of that chart is "the tools are worse than nothing," and it's wrong. What's actually going on is narrower and more interesting: on this harness the best of them is indistinguishable from running nothing, and the other two add tokens. But the reasons are three completely different stories, and none of them is "the product is bad." The intuition that a tool shaving tokens off every turn has to save money overall just doesn't survive contact with an agentic loop. Here's why, one tool at a time.

headroom

headroom is the only tool here that genuinely removed tokens from the wire, and I can show it to the token. On run R03 its compressors stripped 238,074 tokens, 8.47% of the payload, and when I reconciled the proxy's own ledger against the provider's bill they matched exactly: 1,944,790 billed + 151,192 removed equals the counterfactual, to the token. On R10 it removed 147,809, or 5.01%. That's not a dashboard number, it's arithmetic that closes against the actual invoice.

And it still didn't net out. headroom's median came in at +17.8%, inside the band, swamped by path variance. So where did the verified 5–8% saving go? Into turns. The compressed runs were the longest in the whole matrix, 230 and 203 turns against native's ~159, at the lowest tokens-per-turn I measured anywhere. The proxy made each turn cheaper and the run took more of them. That trade is the whole story of this tool, and it gets its own section below.

One caveat I owe headroom, and it cuts in its favor: its cache-stabilization machinery never kicked in on my rail (cached=0 on every run), because I deliberately ran on a non-caching, request-metered provider to isolate the compression by itself. On a cache-priced provider, headroom has a whole second act my bench simply can't see. So I'm probably understating it here.

rtk

rtk did exactly what it says on the box. It rewrote 47 to 79 shell commands per run into terse equivalents and saved 64–76% of the tokens on the commands it touched. The mechanism is flawless. The trouble sits upstream of the mechanism: on a governed harness, shell-command output is only about 1–2.5% of total spend. Forge's bill is file reads and store traffic, and neither of those is a Bash command rtk has a recipe for. You can compress 75% of 2% and the invoice never notices.

rtk's median landed at +30.1%, above the band, and I want to be careful here because that number is not rtk's cost. rtk has no mechanism that adds tokens per turn; its lowest run sits right on native. The +30% is path variance, the same dice that gave A0 an 810K spread, landing on the bad side across three runs. rtk's maintainer made exactly this point to me, pointedly, and he was right to. It's a mismatch between the tool's surface and this architecture, not a defect. The fair verdict on rtk: it works, it just had almost nothing to work on here.

lean-ctx

lean-ctx is the tool I actually use every day and like the most, which is why its result took me the longest to accept. Its headline feature is the cached re-read: read a file once, then read it again for ~13 tokens. On my harness that feature never fired. Not once across seven straight runs.

The reason is structural, and once you see it there's no fixing it from the client side. lean-ctx's cache lives in a server process. Every Forge phase is a fresh process. So a file read in the plan phase and re-read in the implement phase is two cold reads hitting two different cache instances — the repeat happens across phases, where the cache can't see it. The ~13-token stub it would hand back is a back-reference into a conversation the new phase agent never had. A fresh context has to be sent the full payload, and the provider bills whatever gets sent. lean-ctx's cache and my harness's phase isolation are going after the same redundancy, and the harness gets there first.

Meanwhile lean-ctx pays a toll the second it loads: it injects a block of MANDATORY usage rules into every agent's context, roughly ~3K tokens a turn. Over a 160-turn run that's the term that dominates. Its own gain meter, captured live mid-run, tells the story:

@caption: lean-ctx gain — arm a1m, mid-matrix, 49% read adoption, cache working as designed
  ◆  lean-ctx   gain

    tokens saved        5,847
    cache hits              0
    compression       (cold reads only)
    ───────────────────────────────
    net vs bill        0 saved · $-0.001

    cep.sessions            0   (7th consecutive zero)
    ctx_read calls         38   (steering effective — 5× pilot adoption)

Steering works. Adoption is real, 38 to 64 ctx_* calls a run, the maintainer's fixes landed, the cache is genuinely functional. And the meter still reads zero saved, because on this harness there's nothing for the cache to bite on. Net result: +38%, the tool spending more than it brings back. The harness had already removed its surface before it showed up.

Why per-turn savings don't add up

There's an obvious objection here: how can a tool that trims tokens on every single turn fail to save money? It's the right question, and headroom is the case that answers it.

In an agentic loop, trimming context has a second-order effect you can't see if you only look at one turn. When you compress what the model reads, sometimes you compress away a detail it needs later, so it asks again. That re-ask is an extra turn. And extra turns are a loan you repay with interest, because the turns in this pipeline aren't equal weight.

{
  "figure": "FIG. 3",
  "heading": "NOT ALL TURNS COST THE SAME",
  "title": "billed input tokens by phase, representative matrix run — the loop's weight is back-loaded",
  "items": [
    { "label": "writeback", "value": 74000, "display": "74K" },
    { "label": "review-plan", "value": 192000, "display": "192K" },
    { "label": "implement", "value": 222000, "display": "222K" },
    { "label": "review-code", "value": 305000, "display": "305K" },
    { "label": "plan", "value": 471000, "display": "471K" },
    { "label": "validate", "value": 561000, "display": "561K" },
    { "label": "commit", "value": 694000, "display": "694K" }
  ],
  "total": "An extra turn in <strong>commit</strong> or <strong>validate</strong> costs 10×+ an extra turn in writeback. <em>Which</em> turn inflates decides the tokenomics — and a compressor can't choose"
}

The per-turn saving gets booked in the light, early phases. The extra turns it causes land wherever the model has to re-fetch, often in the heavy late phases, where context has piled up and every turn re-pays for all of it. You save a little, early and for sure; you pay more, later and bigger. That's the loan.

I want to be careful with this one, because it's the most tempting claim in the piece and the easiest to oversell. My evidence is consistent with the mechanism: the compression-only isolation run logged 196 turns with zero retrieve calls, so the extra turns tracked the compression rather than failed lookups. But raw phase variance is a real confound, and that 196-turns-with-zero-retrieves number is exactly the data point headroom's maintainer told me he wants to check, because it cuts against his own model of when turns should inflate. So I'll put it the way the data allows: compression appears to induce re-fetch turns. Not "caused." And this loan-with-interest story is about headroom specifically, the one real per-turn compressor here. rtk and lean-ctx fail for the structural reasons above, not this one.

What actually moved the bill

If none of the middleware moved the invoice, something else must have. Something did, and it came from a layer none of these three tools can reach.

@caption: a0c — native Forge, same task, run on an Anthropic model with prompt caching on
  run a0c-T-fix-r1   ·   4ge native   ·   cache-priced rail
  ──────────────────────────────────────────────────────────
  input-side tokens        1,672,470     (92.1% served from cache)
  fresh tokens                   261
  measured cost                 $1.82     (projected $1.81 — bill-exact)
  ──────────────────────────────────────────────────────────
  vs no-cache native        ~72% cheaper   (~$4.80/run saved)
  vs best middleware effect       ~14×      the largest saving any tool managed
  vs Claude Code, same task      3.3×       cheaper end-to-end

Turning on the provider's own prompt caching cut the bill by ~72%, roughly fourteen times the best any middleware managed, and it did it by leaving re-sent tokens nearly free at a layer below where any proxy can reach. On a pipeline like this, the biggest lever on the bill lives at the provider/caching layer and in the harness's own output discipline. That's above and below the slot these middleware tools sit in, not inside it.

That's what the 76 million tokens really bought, more than any single tool's result: a clear read on where the savings actually live.

What I decided

The decision for Forge: we're not adopting an external token or context manager for this pipeline. There's no billing evidence it pays off, and some evidence it costs. Again, that's about fit, not about the products — the fit between these tools and a lean, phase-isolated harness.

If your setup looks like mine, here's what I'd do:

Pick a model and provider with prompt caching. That's where the order-of-magnitude saving actually lives.
Keep tool and command output lean at the harness level. A small addressable surface isn't a gap for middleware to fill; it means the job's already done.
Use these tools for what they're genuinely good at — rtk's command hygiene, headroom's cache-stabilization on a cache-priced rail, lean-ctx's store and read-mode features on a long-lived context — not as a token-savings play on a frugal harness.

Worth repeating the scope, because the cynical misreading is one careless sentence away. This is "on this harness," full stop, not "these tools don't work." rtk genuinely worked, headroom genuinely compresses, lean-ctx's cache is genuinely functional. Every one of them just walked into a harness that had already eaten its surface.

The teams behind the products

Here's the part I didn't expect when I started, and the part I'm happiest to write. Benchmarking other people's work in public, with their names on it, turned out to make the tools better, measurably, inside the run window. I owe all three teams a real thanks. They build genuinely useful tools, they gave their time to results that weren't always flattering, and this project is more correct because they did.

lean-ctx (yvgude) responded to the full results by confirming the decomposition and shipping fixes to main the same day — a rules_injection = "off" mode for hosts that bring their own workflow, a minimal tool surface, and a dashboard that now states its own denominator (issue #361). Their author-provided statement is candid and worth reading in full:

lean-ctx's savings come from cold-read compression and cached re-reads. On a long-lived context, and especially on cache-priced providers — where both the cached re-reads and the injected prefix ride the provider cache — it nets ahead. On a phase-isolated harness (fresh context per phase) on a non-caching, request-metered provider, the cached-re-read lever has no surface and the injected prefix is re-billed every turn, so it can cost tokens; its addressable share there is a few percent. devasur's decomposition is correct. In response we made the meter state its denominator and the per-turn overhead it injects, added rules_injection = off and a minimal surface for hosts that bring their own workflow, and documented the proxy as the way to reach tool output the ctx_* tools can't wrap.

That isn't "the meter lies." It's that the meter overstated the net effect, the owner agreed, and it got fixed to state what it actually measures. That thread is where Part 5 picks up.

headroom (chopratejas / JerrettDavis) asked, reasonably, for time to validate before I published, and offered to take a patch:

We would like to dig deeper into these results. we do not expect a higher number of turns. we do expect it only if headroom_retrieve is called. We are currently working on multiple fixes … so would like you to hold back on the results — until we have had a chance to validate the environment and settings. we would appreciate your patch offer if you can send us a PR.

I held the June 17 date — a frozen protocol is only frozen if the date is too — and said plainly that any fixes landing after the run window fall outside the matrix. But I sent the PR they asked for: compression-only CLI flags, defaults unchanged. They reviewed it and merged it (issue #645). One technical thread between us is still open: my isolation run showed zero retrieves but elevated turns, which runs against their model of when turns should inflate. Neither of us has closed it, and I'd rather leave it on the record as an open question than pretend it's settled.

rtk (aeppling) pushed back on the results, and the pushback was fair scrutiny the piece is better for (issue #2292). Most of what he raised either matched our own findings or was already scoped in our materials. He noted that rtk can't add tokens per turn, which is exactly our reading and not a charge against it, and he pointed at the task surface, which is our ~2.5%-addressable-surface point. His one challenge that cuts across the whole publication is the fair one to take head on: we build Forge, so can this be called independent? Our answer, posted right alongside his: Forge here is the harness, neutral infrastructure applied the same way to every run including the native baseline, not a product being ranked, and forge-compress only touches Forge's own store artifacts, which are held constant across every run, so no tool gets a relative edge. The conflict is disclosed. "Independent" here means no vendor funded, commissioned, or approved the work. The data behind all of that is public, and he's welcome to check it.

Three maintainers, three pretty different reactions, and the same outcome: outside scrutiny made open-source software better, and the tools that engaged came out stronger for it.

What this part doesn't settle

It's one harness, one provider with no cache economics on the rail, one small codebase, one model tier, one interactive operator protocol, N=3 per tool, a single task. Every claim here is scoped to those walls. The whole thing — Dockerfiles, compose files, the frozen protocol, the analysis snippets, every raw transcript — is public at github.com/Entelligentsia/tokbench, and a third party can rebuild it with a docker build and one API key. Bring your own skepticism and check my arithmetic.

Because the result raised two questions sharper than the ones I came in with:

{
  "items": [
    {
      "q": "Where does context management actually belong?",
      "body": "The bench answered an operational question — should we bolt one of these onto Forge — but it raised a market one. The biggest lever on the bill lived at the <em>provider</em> layer, which is precisely where the platforms are quietly absorbing these same functions: server-side caching, compaction, context editing. If the cheapest savings ship below the slot this whole category occupies, the honest question isn't whether any one product is good. It's <strong>where this layer should live at all</strong> — middleware, platform, or harness.",
      "tag": "→ the deep research on platform absorption is part 4"
    },
    {
      "q": "What does the bill actually show — and what does it leave out?",
      "body": "Three meters described this study and none of them agreed: the vendor's benchmark page, the product's own dashboard, and the provider's bill. lean-ctx's owner agreed to make its dashboard state its own denominator — exactly the kind of transparency that makes a meter trustworthy. The same question lives one layer down: the provider's meter is the one that debits, yet the per-request detail behind it evaporates before it reaches any dashboard you can see. <strong>That's a question worth asking carefully, not loudly.</strong>",
      "tag": "→ meter transparency is part 5"
    }
  ]
}

That's where this series stops being about products and starts being about the layer itself. Part 4 is where I stopped building and ran the research instead — 99 agents pointed at one uncomfortable question: whether the platforms are quietly absorbing this whole layer out from under the rest of us. It's up next.

This is Part 3 of "Who Owns the Context Window?" — a series on where context management should live, told through the building of one system that had to answer it. The benchmark, its frozen protocol, and every transcript are public at github.com/Entelligentsia/tokbench. Forge is open source: the Claude Code plugin and forge-cli, which runs on the pi coding agent.

About Boni Gopalan

Elite software architect specializing in AI systems, emotional intelligence, and scalable cloud architectures. Founder of Entelligentsia.