How I Built a Capability Test I'd Actually Trust for a Coding Agent

How I Built a Capability Test I'd Actually Trust for a Coding Agent

See Also

ℹ️
Series (8 parts)

The Token Bill Arrives: Discovering lean-ctx

64 min total read time

The first time I decomposed a month's token bill, the whale wasn't intelligence — it was transit. My agents were re-reading, re-sending, and narrating build logs at frontier-model prices. My first instinct was the same as everyone else's in 2025: reach for middleware. A whole market was waiting.

AI
Series (8 parts)

Who Owns the Context Window? - Series Overview

64 min total read time

A builder's eight-part journey from a Claude Code plugin to a coding harness — through token bills, a public middleware benchmark, compression libraries, and a context governor — ending at a market-wide question: the platforms are absorbing context management, so where should it actually live?

AI
Series (8 parts)

Same Brain, Two Bodies: Forge as a Plugin, Forge as a Harness

64 min total read time

Last week I deleted three of the most carefully written documents of my engineering life, and the overwhelming feeling was relief. Forge lives in two bodies — a Claude Code plugin and a coding harness built on pi — and they are converging on everything except one thing: what my agents see, each turn, in each phase.

AI

How I built a capability test I'd actually trust for a coding agent

Abstract

I run an agentic coding harness, Forge, and I wanted to know whether Claude Fable 5 — newer, and twice the price — is worth routing into it, or just more expensive. Cost is easy to read off a meter; capability is the hard half, and most agent benchmarks can't measure it because the task is too easy to separate strong models. So I built one that isn't: I planted a single-token bug deep in git's version-sorting code, stripped away every way to look up the answer (no git history, no upstream to diff against), and had Forge's /fix-bug orchestration find and fix it from a symptom-only report — driven by Opus 4.8, then Fable 5, then a cheaper Sonnet 4.6, with the harness and every gate held identical.

All of them found and fixed it. On raw capability — can the pipeline localize one wrong cell out of 36 and patch it blind — it's a tie, and that itself is the result: the task is hard enough that passing means something. The differences surfaced in two places the meter doesn't advertise. Precision: Opus and Fable changed exactly the one wrong cell; Sonnet fixed it correctly but touched an adjacent cell too — harmless, as it turned out, but not surgical. Economics: the bill tracks the rate card, not the effort. Sonnet did the most work of any run — 89 turns, 59K output tokens — and was the cheapest. Fable lists at exactly double Opus per token, yet its shorter triage path nearly cancelled the premium: the run that "should" have cost 2× came in at 1.08× ($3.33 vs $3.09). Rank these runs by work and you get one order; rank them by the invoice and you get another.

{
  "figure": "FIG. 1",
  "heading": "THE RESULT, IN THREE NUMBERS",
  "title": "Fable 5 ÷ Opus 4.8 on one real agentic bug-fix — rate, work, and the bill that results",
  "items": [
    { "label": "list rate", "value": 2.00, "display": "2.00×" },
    { "label": "tokens used", "value": 0.50, "display": "0.50×" },
    { "label": "actual bill", "value": 1.08, "display": "1.08×" }
  ],
  "total": "Σ double the rate (2.00×) × half the work (0.50×) ≈ the same bill (1.08×). Fable's shorter path nearly cancels its 2× price — and on a longer task the work saving could tip it the other way."
}

A new model dropped — Claude Fable 5 — and I had the same question I have every time one does: is it actually better for my work, or just newer and more expensive? I run an agentic SDLC harness called forge-cli: it drives a bug from a symptom report through triage, implementation, review, and commit, and it lets me assign which model runs each phase. Routing a new model into that pipeline is a real decision with a real invoice attached, so I wanted evidence, not vibes. Once the rig was built it was cheap to add a third seat, so I swapped a cheaper model, Sonnet 4.6, into the driver chair too, just to see where the bottom of the ladder landed.

Cost is the easy half. You run the same task twice and read the token meters. The hard half is capability: did the more expensive model actually do anything better? Most "capability" benchmarks for coding agents quietly fail to answer this, because the task is too easy to separate the contestants. If both models trivially succeed, you haven't measured capability — you've measured overhead, and the cheaper model wins by definition.

So before I could compare any models, I had to build a task that could actually tell them apart. That turned out to be most of the work, and it's the part worth writing down. The model comparison is almost an afterthought by the end.

{
  "items": [
    {
      "q": "Is the new model actually better — or just newer and more expensive?",
      "body": "Cost you can read off a meter. Capability you have to <strong>earn the right to measure</strong> — with a task hard enough to tell two good models apart.",
      "tag": "→ the whole reason for this build"
    },
    {
      "q": "If every model passes, what did you actually measure?",
      "body": "An easy task measures <em>overhead</em>, not skill — and the cheaper model wins by default. The bug itself has to be able to <strong>separate them</strong> — and when it does, the differentiator turns out to be precision, not pass/fail.",
      "tag": "→ four conditions, next"
    },
    {
      "q": "And how do you know the fix is actually right?",
      "body": "We almost reported a regression that <em>did not exist</em> — because we trusted a re-implementation of the algorithm instead of the real binary. <strong>Verify against the artifact under test, not a model of it.</strong>",
      "tag": "→ the methodology twist, at the end"
    }
  ]
}

What makes a bug a fair, hard test

I wanted a single bug that satisfied four conditions:

  1. Hard to locate, not just hard to fix. The difficulty should be in finding the defect, because localization is where agentic pipelines actually live or die. A one-line fix you can't find is harder than a fifty-line fix sitting under a stack trace.
  2. No crash, plausible output. If the program segfaults or prints an obvious error, the symptom hands you the location. I wanted a bug that produces well-formed, confident, wrong output and exits 0.
  3. Data-dependent. The failure should only show up on some inputs, so that a careless "looks fixed to me" check can pass while the bug is still there. This punishes shallow verification.
  4. No oracle. This is the one most benchmarks skip. If the repo has its full git history, or an upstream remote, "find the bug" collapses into git log -p or git diff upstream -- the_file. The agent doesn't reason about the code; it diffs against a known-good copy. To test reasoning, the known-good copy has to be unavailable.

Real codebases are full of bugs that meet some of these. The trick is meeting all four on purpose, in a way you can verify is genuinely triggered before you spend money running models against it.

The bug I threw out first

My first candidate looked perfect and was useless, and I think the reason is the most useful thing in this whole writeup.

I planted a sign flip in linear-assignment.c — the Jonker–Volgenant solver that backs git range-diff. The defect lived in the solver's dual-variable reduction, the running adjustments that keep its cost estimates consistent as it assigns columns to rows:

int min = COST(!j1, i) - v[!j1];
for (j = 1; j < column_count; j++)
    if (j != j1 && min > COST(j, i) - v[j])
        min = COST(j, i) - v[j];
v[j1] -= min;

My flip negated an update like that last line. It was beautiful: a single character, buried in numerical-optimization code that almost nobody reads, with no crash and no obvious tell. I was pleased with myself.

Then I tested whether it actually did anything. I generated 600+ randomized range-diff scenarios and ran the buggy binary against the corrected one. The output was identical every time. Zero divergences.

The bug was real but behaviorally inert. Two things conspired: the solver's augmentation phase re-derives and self-corrects the dual variable I had corrupted, and git range-diff's cost matrix is diagonal-dominant enough that the assignment almost never changes even when the intermediate math does. I had planted a defect the program routes around.

This is a trap worth naming. A bug you cannot reliably reproduce is a bug the triage gate cannot confirm, the agent cannot test against, and the judge cannot verify. Subtlety is worthless if the symptom never fires. Validate that your planted bug changes observable behavior — deterministically — before you run anything against it. I discarded the sign flip and went looking for a defect I could prove was live.

The bug I kept: one token in versioncmp.c

versioncmp.c is git's implementation of version-aware string comparison — git's port of glibc's strverscmp. It's what makes git tag --sort=version:refname put v1.9 before v1.10 instead of after it, and it backs for-each-ref and branch sorting too.

The interesting thing about this function is that its logic is almost entirely a data table. The comparison runs a small state machine — S_N normal, S_I integer part, S_F fractional part, S_Z leading zeros — and the decision at each step is a lookup into a table git copied verbatim from glibc. Here it is in full, exactly as versioncmp.c declares it:

/* result_type: CMP: return diff; LEN: compare using len_diff/diff */
#define  CMP    2
#define  LEN    3

/* ... */

static const int8_t result_type[] = {
    /* state   x/x  x/d  x/0  d/x  d/d  d/0  0/x  0/d  0/0  */

    /* S_N */  CMP, CMP, CMP, CMP, LEN, CMP, CMP, CMP, CMP,
    /* S_I */  CMP, -1,  -1,  +1,  LEN, LEN, +1,  LEN, LEN,
    /* S_F */  CMP, CMP, CMP, CMP, CMP, CMP, CMP, CMP, CMP,
    /* S_Z */  CMP, +1,  +1,  -1,  CMP, CMP, -1,  CMP, CMP
};

The one cell I changed is the LEN in the S_N row, d/d (digit-against-digit) column — I set it to CMP. It's worth noticing how camouflaged that makes it: the S_I row directly below also carries LEN in its d/d column, so the planted CMP doesn't even look anomalous against its own neighbors. One token, in a 36-entry const array, with a plausible-looking value.

To see why that single cell matters, here's the code that consumes the table, a few lines down:

state = result_type[state * 3 + (((c2 == '0') + (isdigit (c2) != 0)))];

switch (state) {
case CMP:
    return diff;

case LEN:
    while (isdigit (*p1++))
        if (!isdigit (*p2++))
            return 1;

    return isdigit (*p2) ? -1 : diff;
/* ... */

CMP returns immediately on the current pair of characters. LEN does the numeric thing: it keeps walking the digit run, because a longer run of digits is a larger number. Take v1.9 versus v1.10. They first differ at 9 vs 1, so diff = '9' - '1' is positive. Under the correct LEN, the function notices that v1.9's digits have ended while v1.10's continue (*p2 is still '0'), concludes v1.10 is the longer — and therefore larger — number, and returns v1.9 < v1.10. Under the planted CMP, it just returns that positive diff and declares v1.9 the greater. The user-visible result: git tag --sort=version:refname sorts v1.10 before v1.9. Longer-digit components come out smaller.

Check it against the four conditions:

  • Hard to locate. There is no incorrect statement anywhere. Every line of logic is correct. The defect is one value in a const array. To know that cell should be LEN, you have to understand the state machine well enough to reconstruct what the table should contain — or recognize the table from elsewhere.
  • No crash, plausible output. git tag --sort=version:refname exits 0 and prints a clean, sorted-looking list. It's just in the wrong order.
  • Data-dependent. Only unequal-length digit runs misorder. v1.8 vs v1.9 is fine; v1.9 vs v1.10 is not. An agent that tests with same-length versions sees green and moves on.
  • No oracle. I built the test image from a single, historyless, remoteless commit (more on that below), so there is nothing to diff the table against.

One objection I want to meet head-on, because it's the first thing I'd say if I were reading this: the correct table is in the model's training data. glibc's strverscmp is public and old; a model may well "know" the canonical table. That's fine, and I'd argue it's realistic — human engineers know what correct code looks like too. The test isn't whether the model can recite glibc. It's whether the pipeline can take a symptom-only report ("version sort is wrong for v1.10 vs v1.9"), with no version-control oracle, localize the defect to one cell out of 36 in one file out of thousands, and produce a fix that a blind judge can verify by rebuilding and re-running. Recalling the canonical value helps with the last step and does nothing for the hard middle one.

The harness and the protocol

The point of being careful about the bug is so the comparison is about the models, not about the scaffolding. The protocol:

  • Subject: git built from source at commit 89c62ccd3e, with the one planted token, in a Docker image. The freshly built binary is on PATH via git's bin-wrappers, so an edit → make → re-run loop reflects changes immediately.
  • Oracle removed: the image ships a historyless, remoteless, single-commit repo. No .git log, no upstream, nothing to diff against except the agent's own edits. A staging step decontaminates the workspace and strips obvious leaks (the bug report doesn't name the file or the fix).
  • The invariant — Forge and its /fix-bug orchestration. Every run was driven by the same harness, Forge (forge-cli) — the one I took apart in Same Brain, Two Bodies — through the same /fix-bug orchestration: a fixed pipeline of triage → (optional plan-fix → review-plan) → implement → review-code → approve → commit, with the same personas, the same gates, and the same escalation rules. Triage decides whether to take the short path or the full plan-first path based on how hard it judges the bug. The harness, the pipeline, the prompts, the image — all frozen. The only thing that changed between runs was which model sat behind the wheel. That is what makes this a model comparison and not a harness comparison.
  • Scoring is out-of-band. A separate judge rebuilds the agent's committed tree, runs the reproducer, and checks the actual version ordering. It never reads the agent's own claim of success. The reproducer is fixed: tag v1.0 v1.8 v1.9 v1.10 v1.11 v1.100 v2.9 v2.10, run git tag -l --sort=version:refname, and check the order is ascending.

Forge has knobs for which model runs each phase

One detail makes "only the driver changed" a meaningful sentence: Forge doesn't run a single model. Each phase is bound to a persona, personas roll up into three cost tiers — heavy for the review gates, standard for the builders, light for bookkeeping — and you set one provider:model per tier (or override any single phase). Three knobs and the whole pipeline is routed:

{
  "persona-models": {
    "architect": { "provider": "anthropic", "model": "claude-opus-4-8" },
    "engineer":  { "provider": "anthropic", "model": "claude-sonnet-4-6" },
    "collator":  { "provider": "ollama",    "model": "qwen2.5:0.5b" }
  }
}

That provider field is the quietly enormous part, and it isn't Forge's doing — it's pi, the runtime Forge rides on, which resolves roughly a hundred models across every major provider behind one unified interface. Forge just hands pi a (provider, model) string and pi makes Opus, Sonnet, a local qwen, or a frontier OpenAI model interchangeable at the seat level. Thanks to the pi-ai maintainers for the unglamorous parity work — keeping streaming, tool-calls, caching, and token accounting identical across all of them — which is exactly what lets me swap one knob and trust that nothing else moved.

So you can point every phase at a different model and provider if you want to:

{
  "figure": "FIG. 2",
  "heading": "THE KNOBS — ONE EXAMPLE ROUTING",
  "title": "an illustrative config: heavy gates on Opus, build tier on Sonnet, commit on a free local model",
  "knobs": [
    { "tier": "heavy", "provider": "anthropic", "model": "claude-opus-4-8" },
    { "tier": "standard", "provider": "anthropic", "model": "claude-sonnet-4-6" },
    { "tier": "light", "provider": "ollama", "model": "qwen2.5:0.5b" }
  ],
  "rows": [
    { "phase": "triage", "persona": "bug-fixer", "emoji": "🐛", "tier": "standard", "provider": "anthropic", "model": "claude-sonnet-4-6" },
    { "phase": "plan-fix", "persona": "engineer", "emoji": "🌱", "tier": "standard", "provider": "anthropic", "model": "claude-sonnet-4-6" },
    { "phase": "review-plan", "persona": "supervisor", "emoji": "🌿", "tier": "heavy", "provider": "anthropic", "model": "claude-opus-4-8" },
    { "phase": "implement", "persona": "engineer", "emoji": "🌱", "tier": "standard", "provider": "anthropic", "model": "claude-sonnet-4-6" },
    { "phase": "review-code", "persona": "supervisor", "emoji": "🌿", "tier": "heavy", "provider": "anthropic", "model": "claude-opus-4-8" },
    { "phase": "approve", "persona": "architect", "emoji": "🗻", "tier": "heavy", "provider": "anthropic", "model": "claude-opus-4-8" },
    { "phase": "commit", "persona": "collator", "emoji": "🍃", "tier": "light", "provider": "ollama", "model": "qwen2.5:0.5b" }
  ],
  "total": "Σ set three tier knobs, or pin any single phase — that is the whole point of the routing. <strong>This experiment uses none of that freedom on purpose.</strong>"
}

For the experiment I deliberately didn't. To keep it a clean model comparison I pinned both the heavy and standard tiers to a single driver, left the light tier on Haiku for the trivial steps, and froze every phase — then ran it three times, changing exactly one thing each time: the driver model, set to Opus, then Fable, then Sonnet.

The start state: a symptom, and nothing to diff

This is everything the agent began with — a bug report written the way a real one is, by someone who can see the behavior but hasn't found the cause. It names the symptom precisely and hands over a reproducer. It never names the file, the function, or the fix:

GIT-BUG-003: --sort=version:refname orders multi-digit versions incorrectly

When listing refs with version-aware sorting, tags whose numeric components have more digits are ordered as if they were smallerv1.10 sorts before v1.9, v1.100 before v1.11. […] Equal-width components sort correctly; the problem only appears when two components have a different number of digits (9 vs 10). […] Leading-zero and prerelease-suffix sorting appear unaffected.

That last line is the kind of thing a careful reporter notices and a careless one drops — and it's a genuine clue, quietly narrowing the suspect code to the comparison of unequal-length integer runs. But "narrowed to a behavior" is not "pointed at a line." With no git history and no upstream remote baked into the image, the agent cannot git log -p its way to the cause or diff the file against a known-good copy. It has the symptom, the reproducer, and the whole source tree. Localization is the entire job — which is exactly what I wanted to measure.

Mechanically, each phase is a loop: the model calls a tool, the harness runs it, the result comes back, and that round-trip is one turn. Every turn re-reads the whole conversation so far (the append-only context I traced in What Models Want), which is why turns and tokens and dollars are all tangled together — and why I spend the rest of this piece pulling them apart.

{
  "figure": "FIG. 3",
  "title": "each model↔tool round-trip is one turn — and every turn re-reads a context that only grows",
  "turns": 7
}

I ran the bug on the same frozen image once per driver model — Claude Opus 4.8, Claude Fable 5, and Claude Sonnet 4.6, Haiku on the light tier for the trivial steps — plus an earlier Opus attempt on a broken build that I discarded and explain below.

Results

Every model passed. Each localized the defect to the S_N/d/d cell with no version-control oracle, restored LEN, and committed a change the out-of-band judge verified by rebuild-and-reproduce:

  • Opus: "versioncmp: fix version:refname sort of multi-digit numeric components"
  • Fable: "versioncmp: restore LEN entry for S_N digit/digit comparisons"
  • Sonnet: "fix(versioncmp): correct multi-digit version:refname sort order"

On this bug, raw capability is a tie — and that's a real result, not a null one. The task was hard enough that a tie means every model can do something genuinely difficult: reason about a state machine from its behavior and patch a data table blind. But "they all passed" hides the interesting structure. Two of the fixes were canonical — Opus and Fable restored exactly the one wrong cell. Sonnet's was equivalent-correct but not surgical: it fixed the right cell and changed an adjacent one too. (Whether that extra edit is a regression is a story in itself — Section "The cell that didn't matter" below.) So the differentiator on this bug isn't pass/fail. It's surgical precision, and it sorts the models into a ladder.

Here is the whole board, every number verified against forge-cli's own transcript archive and re-derived from token buckets to the cent:

{
  "figure": "FIG. 4",
  "heading": "THE FOUR-RUN BOARD",
  "title": "every run, every number — verified against the transcript archive, re-derived to the cent",
  "corner": "metric",
  "columns": [
    { "name": "R1 Opus", "sub": "blind · sidebar", "muted": true },
    { "name": "R1b Opus", "sub": "clean" },
    { "name": "R2 Fable", "sub": "clean" },
    { "name": "R3 Sonnet", "sub": "outlier", "highlight": true }
  ],
  "rows": [
    { "metric": "Outcome", "cells": ["PASS", "PASS", "PASS", "PASS"] },
    { "metric": "Fix class", "cells": ["canonical", "canonical", "canonical", "equiv-correct"] },
    { "metric": "Path (phases)", "cells": ["A · 5", "B · 7", "A · 5", "A · 5"] },
    { "metric": "Turns", "cells": ["74", "90", "60", "89"] },
    { "metric": "Wall (s)", "cells": ["791", "1,025", "693", "1,227"] },
    { "metric": "Output tok", "cells": ["44,424", "53,025", "26,407", "59,429"] },
    { "metric": "Cache-read tok", "cells": ["1.55M", "1.92M", "1.00M", "2.41M"] },
    { "metric": "Cost — verified", "cells": ["$2.62", "$3.09", "$3.33", "$2.13"], "emphasis": true }
  ],
  "total": "Σ R1 (greyed) is the blind broken-build Opus run, kept as a sidebar; the clean three-way is <strong>R1b · R2 · R3</strong>. Read the bottom row against the rest: the run that did the most work, <strong>R3 Sonnet</strong>, paid the least."
}

The first thing to read off that board is the cost column, because it inverts everything else. The run that did the most work — Sonnet, at 89 turns and 59K output tokens — landed at the bottom of the bill:

{
  "figure": "FIG. 5",
  "heading": "THE COST LADDER",
  "title": "verified bill per run — the cheapest seat did the most work",
  "items": [
    { "label": "Sonnet 4.6", "value": 2.1331, "display": "$2.13" },
    { "label": "Opus 4.8", "value": 3.0916, "display": "$3.09" },
    { "label": "Fable 5", "value": 3.3273, "display": "$3.33" }
  ],
  "total": "Σ same bug, same frozen pipeline — only the driver changed. Cost ranking <strong>Sonnet $2.13 &lt; Opus $3.09 &lt; Fable $3.33</strong> tracks the per-token rate card, <strong>not</strong> the effort: Sonnet did the most work and paid the least."
}

Same capability, different bill: Fable vs Opus

Take the two canonical runs first, because before the run the pricing made the bet look one-sided. Fable lists at exactly double Opus across every billing bucket:

{
  "figure": "FIG. 6",
  "heading": "ON PAPER",
  "title": "list price per million tokens — Fable 5 bills 2× Opus 4.8, every bucket",
  "seriesA": "Claude Fable 5",
  "seriesB": "Claude Opus 4.8",
  "items": [
    { "label": "base input", "a": 10, "b": 5, "aDisplay": "$10", "bDisplay": "$5" },
    { "label": "5m cache write", "a": 12.5, "b": 6.25, "aDisplay": "$12.50", "bDisplay": "$6.25" },
    { "label": "cache hit", "a": 1, "b": 0.5, "aDisplay": "$1", "bDisplay": "$0.50" },
    { "label": "output", "a": 50, "b": 25, "aDisplay": "$50", "bDisplay": "$25" }
  ],
  "total": "Σ Fable bills <strong>exactly 2× Opus</strong> per token — every bucket, no exceptions"
}

So with capability tied and Fable billing double, Fable should simply lose on cost. Here is what actually happened — same frozen image, same bug, same low tier (Haiku) for the trivial steps; only the driver and review model changed:

{
  "figure": "FIG. 7",
  "heading": "HEAD TO HEAD",
  "title": "same frozen image, same bug — every metric, both models (lower leads)",
  "colA": { "name": "Fable 5", "sub": "short path · 5 phases" },
  "colB": { "name": "Opus 4.8", "sub": "full path · 7 phases" },
  "rows": [
    { "metric": "phases", "a": "5", "b": "7", "lead": "a" },
    { "metric": "turns", "a": "60", "b": "90", "lead": "a" },
    { "metric": "output tokens", "a": "26,407", "b": "53,025", "lead": "a" },
    { "metric": "cache-read tokens", "a": "1,001,410", "b": "1,915,053", "lead": "a" },
    { "metric": "cost — the invoice", "a": "$3.33", "b": "$3.09", "lead": "b", "emphasis": true }
  ],
  "total": "Fable leads <strong>every row of work</strong> — fewer phases, turns, and tokens — then loses the only row that's a bill. <strong>Doing less didn't cost less.</strong>"
}

Here are both runs replayed straight from forge-cli's own transcript archive — the same per-phase telemetry the cost figures are computed from. Opus, on the seven-phase path it chose at triage:

{
  "src": "/blog/stories/capability-test-coding-agent/opus.cast",
  "poster": "npt:0:04",
  "theme": "dracula",
  "rows": 30,
  "cols": 100
}

Opus 4.8 fixing GIT-BUG-003: triage → plan-fix → review-plan → implement → review-code → approve → commit. Bottom-right totals, Σ ↓53k ⇪1.92M, match the scoreline above.

And Fable, whose triage skipped the planning phases and went straight to the fix in five:

{
  "src": "/blog/stories/capability-test-coding-agent/fable.cast",
  "poster": "npt:0:04",
  "theme": "dracula",
  "rows": 30,
  "cols": 100
}

Fable 5 on the same bug, same image: triage → implement → review-code → approve → commit. Fewer phases, fewer turns — and the larger bill.

Read that scoreline again, because the two efficiency stories disagree. Opus's triage elected the full path — plan, then plan-review, then implement, seven phases in all. Fable's triage went straight to the fix in five. So Opus took 50% more turns (90 vs 60) and two extra phases of deliberation — and it still cost less: $3.09 against Fable's $3.33.

Line the phases up and the gap has an obvious shape: it's the two phases Fable's triage skipped. In every phase they share, the turn counts are within a few of each other.

{
  "figure": "FIG. 8",
  "heading": "TURNS PER PHASE",
  "title": "round-trips per phase — same image, same bug, two routes through the pipeline",
  "seriesA": "Fable 5 — 5-phase path",
  "seriesB": "Opus 4.8 — 7-phase path",
  "items": [
    { "label": "triage", "a": 13, "b": 16 },
    { "label": "plan-fix", "b": 10 },
    { "label": "review-plan", "b": 13 },
    { "label": "implement", "a": 17, "b": 20 },
    { "label": "review-code", "a": 16, "b": 17 },
    { "label": "approve", "a": 10, "b": 10 },
    { "label": "commit", "a": 4, "b": 4 }
  ],
  "total": "Σ Fable <strong>60 turns</strong> / 5 phases · Opus <strong>90</strong> / 7 — the dashed rows are the two phases Fable never ran"
}

That inversion is the whole point, and it's structural, not luck. Fable's per-token rates are roughly double Opus's in every billing bucket — base input, cache, output. So even when Opus does more work, more turns, more thinking, the invoice comes in lower, because each token is cheaper. Turns measure how much the pipeline did; dollars measure how much it did times the rate. Here they point in opposite directions, and the rate wins. Rank these two by turns and Fable looks leaner. Rank them by the number on the bill and Opus wins. The fewest-turns model is the most expensive one.

Put the dollars next to the turns and the 2× rate shows through directly: in the phases they share, each Fable phase costs roughly double its Opus counterpart — exactly the price card from FIG. 6, applied to nearly the same work.

{
  "figure": "FIG. 9",
  "heading": "DOLLARS PER PHASE",
  "title": "cost per phase — Opus's two extra phases still don't close Fable's per-phase 2×",
  "seriesA": "Fable 5",
  "seriesB": "Opus 4.8",
  "items": [
    { "label": "triage", "a": 1.0222, "b": 0.9658, "aDisplay": "$1.02", "bDisplay": "$0.97" },
    { "label": "plan-fix", "b": 0.2946, "bDisplay": "$0.29" },
    { "label": "review-plan", "b": 0.4851, "bDisplay": "$0.49" },
    { "label": "implement", "a": 0.7506, "b": 0.4899, "aDisplay": "$0.75", "bDisplay": "$0.49" },
    { "label": "review-code", "a": 0.9059, "b": 0.4744, "aDisplay": "$0.91", "bDisplay": "$0.47" },
    { "label": "approve", "a": 0.4979, "b": 0.2738, "aDisplay": "$0.50", "bDisplay": "$0.27" },
    { "label": "commit", "a": 0.1507, "b": 0.1080, "aDisplay": "$0.15", "bDisplay": "$0.11" }
  ],
  "total": "Σ Fable <strong>$3.33</strong> / 5 phases · Opus <strong>$3.09</strong> / 7 — fewer phases, bigger bill"
}

And if you're wondering where the money actually goes inside a single run, it is not where the tokens are. Opus moved 1.92M cache-read tokens — 91% of every token it was billed for — but those cheap re-reads are only a third of the bill. The 53K output tokens, two and a half percent of the volume, are the single biggest line item:

{
  "figure": "FIG. 10",
  "heading": "WHERE THE $3.09 GOES",
  "title": "Opus run — cost by billing bucket (volume and bill rank in opposite orders)",
  "items": [
    { "label": "output", "value": 1.3256, "display": "$1.33" },
    { "label": "cache reads", "value": 0.9575, "display": "$0.96" },
    { "label": "cache writes", "value": 0.8076, "display": "$0.81" },
    { "label": "base input", "value": 0.0009, "display": "$0.001" }
  ],
  "total": "Σ <strong>$3.09</strong> — 53K output tokens (2.5% of volume) are 43% of the bill; 1.92M cache reads (91% of volume) are 31%. <strong>Rate, not volume, decides the damage.</strong>"
}

The cell that didn't matter

Now the bottom of the ladder, because Sonnet is where the capability story lives — not the cost one. Sonnet's triage was the most thorough of the four runs: 35 turns, running the reproducer and a couple of git's own version-sort tests (t7004, t6300) to pin the symptom to the table. It found the bug. Then it fixed two cells where one was wrong.

The canonical fix restores LEN at exactly one index — the S_N row's d/d cell (index 4), the planted token. Opus and Fable each changed that one cell and stopped. Sonnet changed index 4 and the adjacent index 5 (d/0):

-		/* S_N */  CMP, CMP, CMP, CMP, CMP, CMP, CMP, CMP, CMP,
+		/* S_N */  CMP, CMP, CMP, CMP, LEN, LEN, CMP, CMP, CMP,
                                       ^^^  ^^^
                                  idx4 (the bug)  idx5 (extra; upstream has CMP)

Upstream git has LEN at index 4 and CMP at index 5. So Sonnet got the defect dead right and added one unnecessary change next to it. My first instinct — and I'll bet yours too — was: that's a regression. A model that "fixes" a bug by also editing a cell that wasn't broken has changed behavior it didn't need to, and changed behavior is where regressions live.

So I went to confirm it. And this is the part of the post I most want you to take away, because I nearly got it wrong in the most seductive way possible.

My first move was to write a small standalone C re-implementation of versioncmp and fuzz the two table variants against each other. It was fast, it was clean, and it told me Sonnet's table diverged from upstream on ~2.4% of inputs, with crisp counterexamples like v9 vs v06. I had my regression. I almost wrote it up.

Then I did the thing the out-of-band judge exists to force: I stopped trusting my model of the program and ran the actual artifact. I sorted the same tag sets — 4,000 random version strings, heavy on the leading-zero forms my re-implementation flagged, plus every named counterexample — with Sonnet's real compiled git binary and the canonical binary, and diffed the output.

@caption: the verdict that overruled my own fuzzer — real git binaries, 4000 tags
$ md5sum sonnet-sort.txt canonical-sort.txt
  9f3c…a1  sonnet-sort.txt
  9f3c…a1  canonical-sort.txt        # byte-identical
$ # explicit counterexamples from the re-implementation:
$ for v in v9 v06 v09.4 v2 v1; do …; done   # identical ordering in both binaries
  reproducer ✅   t7004-tag.sh (231) ✅   t6300-for-each-ref.sh (429) ✅

Byte-identical. Not one of my fuzzer's counterexamples reproduced in real git. The re-implementation was wrong — it didn't faithfully model git's state-machine handling, so it manufactured divergences the actual comparator never produces. The d/0 cell Sonnet touched is, for the version strings versioncmp actually sees, a don't-care: changing it to LEN is behaviorally identical to upstream's CMP. Sonnet's fix is correct — verified on the real binary, against 4,000 tags and git's own test suite — just not minimal. Verdict: PASS, equivalent-correct. The only non-canonical fix of the four, and not a regression.

I almost reported a bug that wasn't there, because I trusted a re-derivation of the algorithm instead of the thing under test. That is exactly the failure the experiment's out-of-band judge is built to prevent — rebuild the real tree, run the real reproducer, never trust a claim or a model of success — and I'd walked right into it with my own analysis. Verify against the artifact under test, not a model of it. Re-implementations lie; binaries don't.

Two findings fall out of the Sonnet run, then. First, the capability one: all four models localized and fixed the bug, so the differentiator isn't pass/fail — it's surgical precision. The top tier changed exactly the one wrong cell; the cheaper tier got there correctly but with a slightly bigger knife. Localization is cheap once the task is solvable at all; minimality is what the higher tier buys. Second, the cost one, which extends the Fable/Opus lesson all the way down the ladder: Sonnet did the most work of any run — 89 turns, 59K output, 2.4M cache-read — and paid the least, because its output rate ($15/MTok) is a third of Fable's and three-fifths of Opus's. Turns measure effort; the bill is effort times a rate; and across all four runs, the rate is what wins.

The run I threw out (and what it taught me about turns)

There was a third run, and it's why I don't fully trust the turn counts above until I look underneath them. My first Opus attempt landed on a broken image: make git died because an unrelated optional Rust subsystem needed cargo, which wasn't installed (make: *** [...libgitcore.a] Error 127). Opus found and applied the correct fix early, then spent much of its implement phase fighting a build it couldn't run — diagnosing the missing toolchain, and finally writing a standalone test harness to validate its fix by pure reasoning, because it could not compile git to test it. It produced the canonical fix to a state-table bug it never got to execute. Striking to watch — and it inflated that run's implement phase to 29 turns of mostly failed builds.

@caption: the discarded run — make git on the broken image
$ make git
    CC compat/fsmonitor/fsm-path-utils-linux.o
    AR libgit.a
 CARGO target/release/libgitcore.a
/bin/sh: 1: cargo: not found
make: *** [Makefile:3020: target/release/libgitcore.a] Error 127

That is exactly why total turns is a number to handle with tongs. When I re-ran Opus on a working image, the implement phase dropped from 29 turns to 20 — right next to Fable's 17 — confirming the inflation was the environment, not the model. The per-phase work was comparable all along. What moved the total wasn't chattiness; it was triage choosing a longer path. My two Opus runs didn't even agree with each other on which path to take. A metric that swings on a build error and a routing coin-flip is not a measure of model capability. Cost, driven by a fixed rate card, is the stable one — and it ranks the same way every time: Sonnet cheapest, Opus in the middle, Fable the expensive seat, no matter how the turn counts wobble.

Caveats

  • n = 1 per model, one bug. This is a directional pilot, not a statistical claim. A real capability verdict needs multiple bugs of varying kinds and repeated trials.
  • Triage path varies, and it dominates total turns. On the identical image, Opus took the seven-phase path while Fable and Sonnet took the five-phase one, and even my two Opus runs disagreed with each other. Compare per-phase work and compare cost; don't rank models on total turn count from n=1.
  • "Capability" here means the pipeline produced a judge-verified, behaviorally-correct fix. Three runs were canonical; Sonnet's was equivalent-correct (correct plus one inert extra edit, verified on the real binary). I did not score fix quality beyond correctness, minimality, and regression-safety on this one bug.
  • The routing matrices show the experiment's control wiring, not a tuned production config — heavy and standard tiers pinned to one driver on purpose. The point was to isolate the model, not to find the cheapest viable mix (which, on this evidence, would route the build tier down and spend only on the gates).
  • Everything is specific to this harness, this bug, the pi runtime, and the model pricing on the run date.

What I'd tell anyone benchmarking a coding agent

Two things, and they rhyme.

First, the model comparison was the cheap part. The expensive part — the part that determines whether your numbers mean anything — is building a task that can actually separate good from good-enough, and then proving the task is live before you spend a cent on inference. Plant the bug, then try to make it not matter: throw hundreds of randomized inputs at it, and if the buggy and fixed builds ever agree, you don't have a test, you have a story. My first bug taught me that. The second one earned its place by surviving the attempt to dismiss it.

Second — and this is the one I almost flunked — when you check a fix, check the artifact, not your model of it. My own re-implementation of versioncmp confidently reported a regression in Sonnet's fix that did not exist in a single real invocation of git. The discipline that saved it is the same discipline that makes the whole benchmark trustworthy: rebuild the real tree, run the real reproducer, and never let a re-derivation stand in for the thing under test. A judge that grades a story about the code instead of the code is just a more expensive way to be wrong.

And a closing note of thanks. The reason "swap one model, hold everything else perfectly still" is even possible is the unglamorous parity layer underneath Forge: pi and the pi-ai runtime resolve roughly a hundred models across every major provider behind one interface, keeping streaming, tool-calls, caching, and token accounting identical so that a (provider, model) string is the only thing that changes between runs. That grind is invisible exactly when it works, which is most of the time. Thank you, pi-ai — for the knobs, and for making them trustworthy enough to build an experiment on.

About Boni Gopalan

Elite software architect specializing in AI systems, emotional intelligence, and scalable cloud architectures. Founder of Entelligentsia.

Entelligentsia Entelligentsia