Home › Blog › The Token Bill Arrives: Discovering lean-ctx

Who Owns the Context Window?

Part 2 of 8
  1. Part 1 Part 1 Title
  2. Part 2 Part 2 Title
  3. Part 3 Part 3 Title
  4. Part 4 Part 4 Title
  5. Part 5 Part 5 Title
  6. Part 6 Part 6 Title
  7. Part 7 Part 7 Title
  8. Part 8 Part 8 Title
Boni Gopalan June 6, 2026 9 min read AI

The Token Bill Arrives: Discovering lean-ctx

AIAgentsContext EngineeringDeveloper ToolsToken OptimizationSDLC
The Token Bill Arrives: Discovering lean-ctx

See Also

ℹ️
Article

Dynamic Workflows: The Confused Deputy Behind Every agent()

If every side effect in a dynamic workflow is routed through an agent(), then the security of the workflow is the security of agent() — a subagent that can do anything your session permits. A follow-up that probes that surface empirically: the JS lockdown isn't a boundary, the engine adds control-model risk rather than new privilege, and two probabilistic guardrails — model injection resistance and the harness action classifier — both fired but neither is a wall.

AI
Article

Dynamic Workflows: A Deterministic Controller Over LLM Subagents

Claude Code's Workflow tool runs a deterministic JavaScript controller that spawns and awaits LLM subagents. This is the execution model and the sandbox contract — what the script can and can't do, verified by probing the isolate, not just reading the docs: no require, no process, no fetch, a dual-layer determinism guard, and every side effect routed through an agent().

AI
Article

Pi-Ralph Is Smaller Than It Looks and Bigger Than It Acts

Entelligentsia's pi-ralph extension fits a Generator-Critique-Judge agent loop in ~900 lines. A balanced look at what the pi harness hands you for free, what the scaffold latently affords — multi-model role assignment and dynamic tool synthesis via bash — and what production users still have to build.

AI

The Token Bill Arrives: Discovering lean-ctx

The first time I actually decomposed a month's token bill, I did it the way you'd audit any invoice: sort the line items, find the whale. I expected the whale to be intelligence — frontier-model reasoning applied to hard problems, the thing I thought I was paying for. It wasn't. The whale was transit. My agents were spending most of my money re-reading things they had already read, re-sending things they had already sent, and narrating build logs to a model that did not need them narrated.

Part 1 ended with three pains — the bill, the rot, the blindfold — and a promise that each one drove a stage of this journey. This is the bill's part. It's also the part where I do what almost everyone running agents at scale did somewhere in 2025: I went looking for a product to fix it, and found a whole market waiting for me.

What an SDLC pipeline actually reads

To see why the bill compounds, you have to see what a governed pipeline reads per turn. A Forge agent working a single task is not a chatbot with a long memory. It is eight consecutive specialists — planner, reviewer, implementer, reviewer again, validator, approver, archivist, committer — and each one wakes up, loads its slice of project state, and works through tools:

  • Store queries. Task records, sprint context, dependency lookups — JSON, every field, every turn it's needed.
  • File reads. Source files, personas, templates, knowledge-base docs. Read a 200-line file to change four lines; the other 196 ride along.
  • Tool output. npm test passes and says so in 4,000 tokens. tsc fails and says so in 12,000. A store query returns an entity graph with a traversal trace nobody asked for.

None of this is waste in the criminal sense. It's the ordinary metabolism of software work. But an agent loop has a property no human workflow has: every turn re-pays for the accumulated context. The model is stateless; the conversation isn't. That 12,000-token compiler dump from turn six is still in the window at turn twenty, billed again at every turn between. When I instrumented forge-cli runs, a single run-task was pushing two to three million input tokens through the meter across 180–240 model turns — for a task whose actual diff might be four lines. Here is one real run, phase by phase, from the harness's own transcripts:

{
  "figure": "FIG. 1",
  "heading": "ONE TASK, EIGHT PHASES",
  "title": "billed input tokens per phase — real run, four-line fix",
  "items": [
    { "label": "plan", "value": 471364, "display": "471.4K" },
    { "label": "review-plan", "value": 191802, "display": "191.8K" },
    { "label": "implement", "value": 280585, "display": "280.6K" },
    { "label": "review-code", "value": 305218, "display": "305.2K" },
    { "label": "validate", "value": 330824, "display": "330.8K" },
    { "label": "approve", "value": 126931, "display": "126.9K" },
    { "label": "writeback", "value": 224838, "display": "224.8K" },
    { "label": "commit", "value": 344743, "display": "344.7K" }
  ],
  "total": "Σ <strong>2,276,305 input tokens</strong> · 180 turns · 8.5 min of model time — one task, one four-line diff"
}

A sprint is ten of those. The bill doesn't grow with the difficulty of your problems. It grows with the plumbing.

I want to be precise about the emotional sequence here, because I think it's the same one that created this entire product category. First you see the number and assume you're doing something wrong. Then you decompose it and discover the dominant line item is boilerplate in transit — and you feel something close to indignation. I'm paying frontier-model prices to ship unread test output back and forth. And then, because you are an engineer, indignation becomes a requirements document: something should sit between my agent and the model, and squeeze.

The market that was waiting

It turns out that requirements document had already been written, funded, and shipped — several times. By 2025 there was a recognizable class of tool occupying one specific architectural slot: a model-agnostic layer at the agent↔environment boundary that compresses, prunes, or restructures what flows toward the LLM. The pitch, across every product in the category, is the same sentence: most of what your agent feeds the model is waste, and we remove it before you pay for it.

This isn't a fringe claim. The academic version — SWE-Pruner — describes itself as "lightweight middleware at the agent-environment boundary to prune the environment's observation," reports 23–54% token reduction on SWE-Bench-class tasks, and attaches to existing agents without platform cooperation. The slot is real. The demand is real. I was the demand.

Three products define the category for me, and what makes them interesting as a set is that they don't differ on the pitch — they differ on where they put the scalpel:

{
  "figure": "FIG. 2",
  "title": "where each product puts the scalpel",
  "source": "tool calls · shell · reads",
  "sink": "the only meter that debits",
  "cardTitle": "the gauntlet",
  "cardBody": "Tokens flow left to right; brighter, smaller glyphs have been compressed. Click a gate to see what that layer can — and cannot — intercept. Toggle to a governed harness and watch the violet store traffic leave the gauntlet entirely.",
  "bypassLabel": "governed traffic — artifact handoffs · store tools — invisible to middleware",
  "layers": [
    {
      "product": "lean-ctx",
      "name": "tool layer",
      "sees": "the calls the model chooses to route through ctx_* tools — adoption is voluntary, every turn",
      "misses": "every native read/bash call the model makes out of habit; all governed store traffic",
      "toll": "instructions injected into the agent's context — tokens, attention, obedience"
    },
    {
      "product": "rtk",
      "name": "command layer",
      "sees": "shell commands matching its built-in recipe book (git, cat, ls, npm test…) — automatically",
      "misses": "commands without recipes — project tool scripts pass through untouched; file reads; store traffic",
      "toll": "recipe coverage — the book is compiled in, and your stack may not be in it"
    },
    {
      "product": "headroom",
      "name": "wire layer",
      "sees": "every byte of every request — tool results, prompts, even harness-internal traffic",
      "misses": "nothing in transit — but it must guess from outside what is safe to crush",
      "toll": "a proxy hop of latency on every single request"
    }
  ]
}

lean-ctx works at the tool layer. It hands your agent a parallel toolset — ctx_read instead of read, ctx_shell instead of bash, ctx_grep instead of grep — each returning a compressed, cached, mode-aware version of what the native tool would have returned. Ten read modes: full file, dependency map, signatures only, diff since last read. Its headline promise is the cached re-read: read a file once, and reading it again costs ~13 tokens — a receipt instead of a reprint.

rtk works at the command layer. It doesn't ask the model to do anything differently. A hook intercepts every shell command on its way to execution and rewrites the ones it has recipes for — git status becomes rtk git status, which produces the same information in a fraction of the tokens. The model never knows. Invisible, automatic, recipe-bound.

headroom works at the wire. It's an HTTP proxy: point your provider URL at it, and every request — the whole payload, every tool result, everything — passes through its compressors before crossing the network. It sees the most of any of them, because it sees what the model would see.

Tool, command, wire. Persuasion, recipes, plumbing. Hold that gradient — the whole series eventually hangs off it.

Discovering lean-ctx

I found lean-ctx the way you find these things: late, tired, mid-bill-indignation, scrolling. And I want to write about what adoption actually felt like, because the felt experience is data too, and it would be dishonest to skip it.

lean-ctx is seductive. I don't mean that pejoratively — I mean it is a genuinely well-crafted developer experience aimed precisely at the wound. You install it, it injects its rules into your agent's instructions, and suddenly your tools have modes. Reading a file for orientation? signatures mode gives you the API surface for a tenth of the tokens. Re-reading after an edit? diff mode. Big log dump? The shell wrapper knows ninety-five compression patterns. And floating above all of it: a dashboard. Tokens saved. Compression ratio. Dollars. A little pixel-art companion that evolves as your savings grow — mine is named Keen Orbit, and I am not too proud to tell you that I have checked on it.

The machine I'm typing this on runs lean-ctx in my Claude Code config right now. Daily driver, months deep. This is its actual meter, captured while writing this paragraph:

@caption: lean-ctx gain — my machine, eleven days into the current streak
  ╭──────────────────────────────────────────────────────────────╮
  │                                                              │
  │    ◆  lean-ctx   Token Savings Dashboard                     │
  │                                                              │
  ├──────────────────────────────────────────────────────────────┤
  │                                                              │
  │    7.7M          59.1%         6,899         $23.48          │
  │    tokens saved  compression   commands      USD saved       │
  │                                                              │
  ╰──────────────────────────────────────────────────────────────╯

         ()    ()       Keen Orbit | Egg | Rare | Lv.29
       (--------)       Mood: Happy | XP: 42.2K
      (  :o  o:  )      "4-day streak! Keep going!"
      (   ===    )
       (--------)
        | :::: |
       /\      /\

  Cost Breakdown  @ $2.50/M input · $10.00/M output
  ──────────────────────────────────────────────────────────────

    Without lean-ctx       $45.11   $32.69 input + $12.42 output
    With lean-ctx          $21.64   $13.36 input + $8.28 output

    You saved              $23.48   input $19.34 + output $4.14

  Savings by Source
  ──────────────────────────────────────────────────────────────

    MCP Tools       5682x  ██████████████▊    6.3M   68.7% rate ·  81.7% of total
    Shell Hooks     1217x  ███▍               1.4M   36.5% rate ·  18.3% of total

    Since 2026-05-22 · 11 days   ▁█▁▁▅▃▁▁▂▂▁

In interactive use — me, a chat, a codebase, human-paced questions — it feels unambiguously good. Files come back tighter. Re-reads feel free. The dashboard ticks upward and tells you that you are a thrifty and virtuous engineer. Seven point seven million tokens, says the meter. Twenty-three dollars and forty-eight cents. Hold those numbers gently — not because they're false, but because of whose meter they're on. That distinction is about to do a lot of work.

So the natural move was: put it under Forge. Let the eight-phase pipeline — the thing actually generating the bill — run on the compressed toolset. Multiply the dashboard's percentages by a sprint's two-hundred-turn appetite and watch the line item shrink.

And it was somewhere around there — between the install and the invoice — that three questions started nagging, the kind a paying customer eventually can't un-ask.

The three questions

{
  "items": [
    {
      "q": "Whose meter do I trust?",
      "body": "The vendor's benchmark page says one number — sixty, eighty, ninety percent. The product's own dashboard says another — <em>its</em> accounting of <em>its</em> savings, on the traffic <em>it</em> touched. And the provider's bill says a third — the only number that actually debits anything. These three meters describe different things measured at different layers, and nothing anywhere guarantees they agree. Nearly everything written about this category cites the first two meters. <strong>Nobody cites the bill.</strong>",
      "tag": "→ put to the bench in part 3 · the meter itself goes on trial in part 6"
    },
    {
      "q": "How much of my bill can a middleware even see?",
      "body": "Forge is not a naive harness. Phases are context-isolated; agents hand off through reviewed artifacts; store traffic moves through Forge's own compact tools. By the time middleware gets to look, the harness has already governed away most of what middleware exists to fix. What's left is a <em>slice</em>, not the bill. A product can compress its slice by ninety percent, claim ninety percent in good faith on its own meter, and move the invoice by almost nothing. <strong>The ceiling isn't the compression ratio. It's the addressable surface</strong> — and no product's marketing tells you how big your surface is.",
      "tag": "→ measured in part 3 · the same ceiling humbles my own library in part 4"
    },
    {
      "q": "What does engagement cost?",
      "body": "Each scalpel position pays a different toll. A tool-layer product only works if the model actually <em>chooses</em> its tools, every turn, against the gravitational pull of the natives it was trained on — so it spends instructions, and instructions are tokens, attention, obedience. A command-layer product engages automatically, but only on commands it has recipes for. A wire proxy engages on everything — and adds a hop of latency to every request. <strong>Nothing in this category is free.</strong> The question is whether what it buys exceeds what it spends, on your harness, at your meter.",
      "tag": "→ each layer pays its toll on camera in part 3"
    }
  ]
}

The ceiling, and what's pressing down on it

There's a structural version of question two that deserves its own paragraph, because it's the quiet thesis of this whole series.

Generic compression — middleware compression, harness-blind by design — can only ever optimize the traffic that reaches it. But the two layers it sits between are both moving. Above it, harnesses are getting opinionated: phase isolation, artifact handoffs, governed state tools — context management as architecture, which shrinks the compressible slice before middleware sees a byte. Below it, the platforms are absorbing the same functions server-side: tool-search primitives that cut definition bloat, programmatic tool calling that keeps output out of context entirely, server-side compaction, context editing, and — the deepest of all — cache pricing that makes re-sent tokens nearly free at the provider's own layer, where no proxy can follow.

Remember Part 1's distinction: absorption above the line hands you a hook; absorption below the line hands you a curtain. The middleware category lives in the narrowing gap between an architectural layer rising from above and an economic layer rising from below. Whether the gap is still wide enough to hold a product category — that's not a question you can answer from anybody's marketing page. From either direction.

Only one way to settle it

Here's what I had at this point: a bill I understood, a market that said it could cut it, a product I genuinely liked using, and three questions none of the available meters could answer. The engineering disposition allows exactly one move from that square.

Measure it. Properly, or not at all.

Same real task, same harness, same models, same golden-reset starting state, one neutral meter — the provider's bill, as reported in the harness's own transcripts, request by request, phase by phase. All three products, each at its own layer: lean-ctx at the tools, rtk at the commands, headroom on the wire. A pre-registered protocol, frozen in git before the runs, so the analysis couldn't quietly bend toward a conclusion. And because benchmarking other people's products carries an obligation of fairness: a review-request issue filed with each maintainer, linking the full setup, asking whether the integration is faithful, and promising their responses verbatim in the writeup.

I expected the bench to referee a product comparison. It did rather more than that. One vendor shipped a fix within hours. One product turned out to be working flawlessly on a slice too small to matter. And the meters — all three of them — told stories so different that the differences became the finding.

That's Part 3, on June 17. The bench is public; bring your own skepticism.


This is Part 2 of "Who Owns the Context Window?" — a series on where context management should live, told through the building of one system that had to answer it. The benchmark, its protocol, and every transcript are public at github.com/Entelligentsia/tokbench. Forge is open source: the Claude Code plugin and forge-cli, which runs on the pi coding agent.

More Articles

Dynamic Workflows: The Confused Deputy Behind Every agent()

Dynamic Workflows: The Confused Deputy Behind Every agent()

If every side effect in a dynamic workflow is routed through an agent(), then the security of the workflow is the security of agent() — a subagent that can do anything your session permits. A follow-up that probes that surface empirically: the JS lockdown isn't a boundary, the engine adds control-model risk rather than new privilege, and two probabilistic guardrails — model injection resistance and the harness action classifier — both fired but neither is a wall.

Boni Gopalan 11 min read
Dynamic Workflows: A Deterministic Controller Over LLM Subagents

Dynamic Workflows: A Deterministic Controller Over LLM Subagents

Claude Code's Workflow tool runs a deterministic JavaScript controller that spawns and awaits LLM subagents. This is the execution model and the sandbox contract — what the script can and can't do, verified by probing the isolate, not just reading the docs: no require, no process, no fetch, a dual-layer determinism guard, and every side effect routed through an agent().

Boni Gopalan 13 min read
Pi-Ralph Is Smaller Than It Looks and Bigger Than It Acts

Pi-Ralph Is Smaller Than It Looks and Bigger Than It Acts

Entelligentsia's pi-ralph extension fits a Generator-Critique-Judge agent loop in ~900 lines. A balanced look at what the pi harness hands you for free, what the scaffold latently affords — multi-model role assignment and dynamic tool synthesis via bash — and what production users still have to build.

Boni Gopalan 8 min read
Previous Part 1 Title Next Part 3 Title

About Boni Gopalan

Elite software architect specializing in AI systems, emotional intelligence, and scalable cloud architectures. Founder of Entelligentsia.

Entelligentsia Entelligentsia