How I Built a Capability Test I'd Actually Trust for a Coding Agent

How I Built a Capability Test I'd Actually Trust for a Coding Agent

See Also

ℹ️
Series (8 parts)

The Token Bill Arrives: Discovering lean-ctx

64 min total read time

The first time I decomposed a month's token bill, the whale wasn't intelligence — it was transit. My agents were re-reading, re-sending, and narrating build logs at frontier-model prices. My first instinct was the same as everyone else's in 2025: reach for middleware. A whole market was waiting.

AI
Series (8 parts)

Who Owns the Context Window? - Series Overview

64 min total read time

A builder's eight-part journey from a Claude Code plugin to a coding harness — through token bills, a public middleware benchmark, compression libraries, and a context governor — ending at a market-wide question: the platforms are absorbing context management, so where should it actually live?

AI
Series (8 parts)

Same Brain, Two Bodies: Forge as a Plugin, Forge as a Harness

64 min total read time

Last week I deleted three of the most carefully written documents of my engineering life, and the overwhelming feeling was relief. Forge lives in two bodies — a Claude Code plugin and a coding harness built on pi — and they are converging on everything except one thing: what my agents see, each turn, in each phase.

AI

How I built a capability test I'd actually trust for a coding agent

Abstract

I run an agentic coding harness, Forge, and I wanted to know whether Claude Fable 5 — newer, and twice the price — is worth routing into it, or just more expensive. Cost is easy to read off a meter; capability is the hard half, and most agent benchmarks can't measure it because the task is too easy to separate two strong models. So I built one that isn't: I planted a single-token bug deep in git's version-sorting code, stripped away every way to look up the answer (no git history, no upstream to diff against), and had Forge's /fix-bug orchestration find and fix it from a symptom-only report — once driven by Fable 5, once by Opus 4.8, with the harness and every gate held identical.

Both models found and correctly fixed it: on capability, a tie. The economics were the surprise. Fable's list price is exactly double Opus's, per token. But Fable's triage took a shorter path — five pipeline phases to Opus's seven — and so used about half the tokens and a third fewer turns. Double the rate against half the work nearly cancel: the run that "should" have cost 2× came in at 1.08× ($3.33 vs $3.09). It's one task, so treat it as an early signal — but it points somewhere specific: on longer agentic tasks, where a leaner path compounds, a pricier-per-token model can even out, or come out ahead.

{
  "figure": "FIG. 1",
  "heading": "THE RESULT, IN THREE NUMBERS",
  "title": "Fable 5 ÷ Opus 4.8 on one real agentic bug-fix — rate, work, and the bill that results",
  "items": [
    { "label": "list rate", "value": 2.00, "display": "2.00×" },
    { "label": "tokens used", "value": 0.50, "display": "0.50×" },
    { "label": "actual bill", "value": 1.08, "display": "1.08×" }
  ],
  "total": "Σ double the rate (2.00×) × half the work (0.50×) ≈ the same bill (1.08×). Fable's shorter path nearly cancels its 2× price — and on a longer task the work saving could tip it the other way."
}

A new model dropped — Claude Fable 5 — and I had the same question I have every time one does: is it actually better for my work, or just newer and more expensive? I run an agentic SDLC harness called forge-cli: it drives a bug from a symptom report through triage, implementation, review, and commit, and it lets me assign a different model to each phase. Routing a new model into that pipeline is a real decision with a real invoice attached, so I wanted evidence, not vibes.

Cost is the easy half. You run the same task twice and read the token meters. The hard half is capability: did the more expensive model actually do anything better? Most "capability" benchmarks for coding agents quietly fail to answer this, because the task is too easy to separate the contestants. If both models trivially succeed, you haven't measured capability — you've measured overhead, and the cheaper model wins by definition.

So before I could compare two models, I had to build a task that could actually tell them apart. That turned out to be most of the work, and it's the part worth writing down. The model comparison is almost an afterthought by the end.

{
  "items": [
    {
      "q": "Is the new model actually better — or just newer and more expensive?",
      "body": "Cost you can read off a meter. Capability you have to <strong>earn the right to measure</strong> — with a task hard enough to tell two good models apart.",
      "tag": "→ the whole reason for this build"
    },
    {
      "q": "If both models pass, what did you actually measure?",
      "body": "An easy task measures <em>overhead</em>, not skill — and the cheaper model wins by default. The bug itself has to be able to <strong>separate them</strong>.",
      "tag": "→ four conditions, next"
    }
  ]
}

What makes a bug a fair, hard test

I wanted a single bug that satisfied four conditions:

  1. Hard to locate, not just hard to fix. The difficulty should be in finding the defect, because localization is where agentic pipelines actually live or die. A one-line fix you can't find is harder than a fifty-line fix sitting under a stack trace.
  2. No crash, plausible output. If the program segfaults or prints an obvious error, the symptom hands you the location. I wanted a bug that produces well-formed, confident, wrong output and exits 0.
  3. Data-dependent. The failure should only show up on some inputs, so that a careless "looks fixed to me" check can pass while the bug is still there. This punishes shallow verification.
  4. No oracle. This is the one most benchmarks skip. If the repo has its full git history, or an upstream remote, "find the bug" collapses into git log -p or git diff upstream -- the_file. The agent doesn't reason about the code; it diffs against a known-good copy. To test reasoning, the known-good copy has to be unavailable.

Real codebases are full of bugs that meet some of these. The trick is meeting all four on purpose, in a way you can verify is genuinely triggered before you spend money running models against it.

The bug I threw out first

My first candidate looked perfect and was useless, and I think the reason is the most useful thing in this whole writeup.

I planted a sign flip in linear-assignment.c — the Jonker–Volgenant solver that backs git range-diff. The defect lived in the solver's dual-variable reduction, the running adjustments that keep its cost estimates consistent as it assigns columns to rows:

int min = COST(!j1, i) - v[!j1];
for (j = 1; j < column_count; j++)
    if (j != j1 && min > COST(j, i) - v[j])
        min = COST(j, i) - v[j];
v[j1] -= min;

My flip negated an update like that last line. It was beautiful: a single character, buried in numerical-optimization code that almost nobody reads, with no crash and no obvious tell. I was pleased with myself.

Then I tested whether it actually did anything. I generated 600+ randomized range-diff scenarios and ran the buggy binary against the corrected one. The output was identical every time. Zero divergences.

The bug was real but behaviorally inert. Two things conspired: the solver's augmentation phase re-derives and self-corrects the dual variable I had corrupted, and git range-diff's cost matrix is diagonal-dominant enough that the assignment almost never changes even when the intermediate math does. I had planted a defect the program routes around.

This is a trap worth naming. A bug you cannot reliably reproduce is a bug the triage gate cannot confirm, the agent cannot test against, and the judge cannot verify. Subtlety is worthless if the symptom never fires. Validate that your planted bug changes observable behavior — deterministically — before you run anything against it. I discarded the sign flip and went looking for a defect I could prove was live.

The bug I kept: one token in versioncmp.c

versioncmp.c is git's implementation of version-aware string comparison — git's port of glibc's strverscmp. It's what makes git tag --sort=version:refname put v1.9 before v1.10 instead of after it, and it backs for-each-ref and branch sorting too.

The interesting thing about this function is that its logic is almost entirely a data table. The comparison runs a small state machine — S_N normal, S_I integer part, S_F fractional part, S_Z leading zeros — and the decision at each step is a lookup into a table git copied verbatim from glibc. Here it is in full, exactly as versioncmp.c declares it:

/* result_type: CMP: return diff; LEN: compare using len_diff/diff */
#define  CMP    2
#define  LEN    3

/* ... */

static const int8_t result_type[] = {
    /* state   x/x  x/d  x/0  d/x  d/d  d/0  0/x  0/d  0/0  */

    /* S_N */  CMP, CMP, CMP, CMP, LEN, CMP, CMP, CMP, CMP,
    /* S_I */  CMP, -1,  -1,  +1,  LEN, LEN, +1,  LEN, LEN,
    /* S_F */  CMP, CMP, CMP, CMP, CMP, CMP, CMP, CMP, CMP,
    /* S_Z */  CMP, +1,  +1,  -1,  CMP, CMP, -1,  CMP, CMP
};

The one cell I changed is the LEN in the S_N row, d/d (digit-against-digit) column — I set it to CMP. It's worth noticing how camouflaged that makes it: the S_I row directly below also carries LEN in its d/d column, so the planted CMP doesn't even look anomalous against its own neighbors. One token, in a 36-entry const array, with a plausible-looking value.

To see why that single cell matters, here's the code that consumes the table, a few lines down:

state = result_type[state * 3 + (((c2 == '0') + (isdigit (c2) != 0)))];

switch (state) {
case CMP:
    return diff;

case LEN:
    while (isdigit (*p1++))
        if (!isdigit (*p2++))
            return 1;

    return isdigit (*p2) ? -1 : diff;
/* ... */

CMP returns immediately on the current pair of characters. LEN does the numeric thing: it keeps walking the digit run, because a longer run of digits is a larger number. Take v1.9 versus v1.10. They first differ at 9 vs 1, so diff = '9' - '1' is positive. Under the correct LEN, the function notices that v1.9's digits have ended while v1.10's continue (*p2 is still '0'), concludes v1.10 is the longer — and therefore larger — number, and returns v1.9 < v1.10. Under the planted CMP, it just returns that positive diff and declares v1.9 the greater. The user-visible result: git tag --sort=version:refname sorts v1.10 before v1.9. Longer-digit components come out smaller.

Check it against the four conditions:

  • Hard to locate. There is no incorrect statement anywhere. Every line of logic is correct. The defect is one value in a const array. To know that cell should be LEN, you have to understand the state machine well enough to reconstruct what the table should contain — or recognize the table from elsewhere.
  • No crash, plausible output. git tag --sort=version:refname exits 0 and prints a clean, sorted-looking list. It's just in the wrong order.
  • Data-dependent. Only unequal-length digit runs misorder. v1.8 vs v1.9 is fine; v1.9 vs v1.10 is not. An agent that tests with same-length versions sees green and moves on.
  • No oracle. I built the test image from a single, historyless, remoteless commit (more on that below), so there is nothing to diff the table against.

One objection I want to meet head-on, because it's the first thing I'd say if I were reading this: the correct table is in the model's training data. glibc's strverscmp is public and old; a model may well "know" the canonical table. That's fine, and I'd argue it's realistic — human engineers know what correct code looks like too. The test isn't whether the model can recite glibc. It's whether the pipeline can take a symptom-only report ("version sort is wrong for v1.10 vs v1.9"), with no version-control oracle, localize the defect to one cell out of 36 in one file out of thousands, and produce a fix that a blind judge can verify by rebuilding and re-running. Recalling the canonical value helps with the last step and does nothing for the hard middle one.

The harness and the protocol

The point of being careful about the bug is so the comparison is about the models, not about the scaffolding. The protocol:

  • Subject: git built from source at commit 89c62ccd3e, with the one planted token, in a Docker image. The freshly built binary is on PATH via git's bin-wrappers, so an edit → make → re-run loop reflects changes immediately.
  • Oracle removed: the image ships a historyless, remoteless, single-commit repo. No .git log, no upstream, nothing to diff against except the agent's own edits. A staging step decontaminates the workspace and strips obvious leaks (the bug report doesn't name the file or the fix).
  • The invariant — Forge and its /fix-bug orchestration. Every run was driven by the same harness, Forge (forge-cli) — the one I took apart in Same Brain, Two Bodies — through the same /fix-bug orchestration: a fixed pipeline of triage → (optional plan-fix → review-plan) → implement → review-code → approve → commit, with the same personas, the same gates, and the same escalation rules. Triage decides whether to take the short path or the full plan-first path based on how hard it judges the bug. The harness, the pipeline, the prompts, the image — all frozen. The only thing that changed between the two runs was which model sat behind the wheel. That is what makes this a model comparison and not a harness comparison.
  • Scoring is out-of-band. A separate judge rebuilds the agent's committed tree, runs the reproducer, and checks the actual version ordering. It never reads the agent's own claim of success. The reproducer is fixed: tag v1.0 v1.8 v1.9 v1.10 v1.11 v1.100 v2.9 v2.10, run git tag -l --sort=version:refname, and check the order is ascending.

The start state: a symptom, and nothing to diff

This is everything the agent began with — a bug report written the way a real one is, by someone who can see the behavior but hasn't found the cause. It names the symptom precisely and hands over a reproducer. It never names the file, the function, or the fix:

GIT-BUG-003: --sort=version:refname orders multi-digit versions incorrectly

When listing refs with version-aware sorting, tags whose numeric components have more digits are ordered as if they were smallerv1.10 sorts before v1.9, v1.100 before v1.11. […] Equal-width components sort correctly; the problem only appears when two components have a different number of digits (9 vs 10). […] Leading-zero and prerelease-suffix sorting appear unaffected.

That last line is the kind of thing a careful reporter notices and a careless one drops — and it's a genuine clue, quietly narrowing the suspect code to the comparison of unequal-length integer runs. But "narrowed to a behavior" is not "pointed at a line." With no git history and no upstream remote baked into the image, the agent cannot git log -p its way to the cause or diff the file against a known-good copy. It has the symptom, the reproducer, and the whole source tree. Localization is the entire job — which is exactly what I wanted to measure.

Mechanically, each phase is a loop: the model calls a tool, the harness runs it, the result comes back, and that round-trip is one turn. Every turn re-reads the whole conversation so far (the append-only context I traced in What Models Want), which is why turns and tokens and dollars are all tangled together — and why I spend the rest of this piece pulling them apart.

{
  "figure": "FIG. 2",
  "title": "each model↔tool round-trip is one turn — and every turn re-reads a context that only grows",
  "turns": 7
}

I ran the bug on the same frozen image once per model — Claude Opus 4.8 and Claude Fable 5, Haiku for the trivial steps — plus an earlier Opus attempt on a broken build that I discarded and explain below.

Results

Both models passed, and both fixes were canonical. Each localized the defect to the S_N/d/d cell with no version-control oracle, restored LEN, and committed a change the out-of-band judge verified by rebuild-and-reproduce:

  • Opus: "versioncmp: fix version:refname sort of multi-digit numeric components"
  • Fable: "versioncmp: restore LEN entry for S_N digit/digit comparisons"

On this bug, capability is a tie — and that's a real result, not a null one. The task was hard enough that a tie means both models can do something genuinely difficult: reason about a state machine from its behavior and patch a data table blind. It also sets a floor for a harder follow-up that can separate them.

The interesting part is what getting there cost — and before the run, the pricing made the bet look one-sided. Fable lists at exactly double Opus across every billing bucket:

{
  "figure": "FIG. 3",
  "heading": "ON PAPER",
  "title": "list price per million tokens — Fable 5 bills 2× Opus 4.8, every bucket",
  "seriesA": "Claude Fable 5",
  "seriesB": "Claude Opus 4.8",
  "items": [
    { "label": "base input", "a": 10, "b": 5, "aDisplay": "$10", "bDisplay": "$5" },
    { "label": "5m cache write", "a": 12.5, "b": 6.25, "aDisplay": "$12.50", "bDisplay": "$6.25" },
    { "label": "cache hit", "a": 1, "b": 0.5, "aDisplay": "$1", "bDisplay": "$0.50" },
    { "label": "output", "a": 50, "b": 25, "aDisplay": "$50", "bDisplay": "$25" }
  ],
  "total": "Σ Fable bills <strong>exactly 2× Opus</strong> per token — every bucket, no exceptions"
}

So with capability tied and Fable billing double, Fable should simply lose on cost. Here is what actually happened — same frozen image, same bug, same low tier (Haiku) for the trivial steps; only the driver and review model changed:

{
  "figure": "FIG. 4",
  "heading": "HEAD TO HEAD",
  "title": "same frozen image, same bug — every metric, both models (lower leads)",
  "colA": { "name": "Fable 5", "sub": "short path · 5 phases" },
  "colB": { "name": "Opus 4.8", "sub": "full path · 7 phases" },
  "rows": [
    { "metric": "phases", "a": "5", "b": "7", "lead": "a" },
    { "metric": "turns", "a": "60", "b": "90", "lead": "a" },
    { "metric": "output tokens", "a": "26,407", "b": "53,025", "lead": "a" },
    { "metric": "cache-read tokens", "a": "1,001,410", "b": "1,915,053", "lead": "a" },
    { "metric": "cost — the invoice", "a": "$3.33", "b": "$3.09", "lead": "b", "emphasis": true }
  ],
  "total": "Fable leads <strong>every row of work</strong> — fewer phases, turns, and tokens — then loses the only row that's a bill. <strong>Doing less didn't cost less.</strong>"
}

Here are both runs replayed straight from forge-cli's own transcript archive — the same per-phase telemetry the cost figures are computed from. Opus, on the seven-phase path it chose at triage:

{
  "src": "/blog/stories/capability-test-coding-agent/opus.cast",
  "poster": "npt:0:04",
  "theme": "dracula",
  "rows": 30,
  "cols": 100
}

Opus 4.8 fixing GIT-BUG-003: triage → plan-fix → review-plan → implement → review-code → approve → commit. Bottom-right totals, Σ ↓53k ⇪1.92M, match the scoreline above.

And Fable, whose triage skipped the planning phases and went straight to the fix in five:

{
  "src": "/blog/stories/capability-test-coding-agent/fable.cast",
  "poster": "npt:0:04",
  "theme": "dracula",
  "rows": 30,
  "cols": 100
}

Fable 5 on the same bug, same image: triage → implement → review-code → approve → commit. Fewer phases, fewer turns — and the larger bill.

Read that scoreline again, because the two efficiency stories disagree. Opus's triage elected the full path — plan, then plan-review, then implement, seven phases in all. Fable's triage went straight to the fix in five. So Opus took 50% more turns (90 vs 60) and two extra phases of deliberation — and it still cost less: $3.09 against Fable's $3.33.

Line the phases up and the gap has an obvious shape: it's the two phases Fable's triage skipped. In every phase they share, the turn counts are within a few of each other.

{
  "figure": "FIG. 5",
  "heading": "TURNS PER PHASE",
  "title": "round-trips per phase — same image, same bug, two routes through the pipeline",
  "seriesA": "Fable 5 — 5-phase path",
  "seriesB": "Opus 4.8 — 7-phase path",
  "items": [
    { "label": "triage", "a": 13, "b": 16 },
    { "label": "plan-fix", "b": 10 },
    { "label": "review-plan", "b": 13 },
    { "label": "implement", "a": 17, "b": 20 },
    { "label": "review-code", "a": 16, "b": 17 },
    { "label": "approve", "a": 10, "b": 10 },
    { "label": "commit", "a": 4, "b": 4 }
  ],
  "total": "Σ Fable <strong>60 turns</strong> / 5 phases · Opus <strong>90</strong> / 7 — the dashed rows are the two phases Fable never ran"
}

That inversion is the whole point, and it's structural, not luck. Fable's per-token rates are roughly double Opus's in every billing bucket — base input, cache, output. So even when Opus does more work, more turns, more thinking, the invoice comes in lower, because each token is cheaper. Turns measure how much the pipeline did; dollars measure how much it did times the rate. Here they point in opposite directions, and the rate wins. Rank these two by turns and Fable looks leaner. Rank them by the number on the bill and Opus wins. The fewest-turns model is the most expensive one.

Put the dollars next to the turns and the 2× rate shows through directly: in the phases they share, each Fable phase costs roughly double its Opus counterpart — exactly the price card from FIG. 3, applied to nearly the same work.

{
  "figure": "FIG. 6",
  "heading": "DOLLARS PER PHASE",
  "title": "cost per phase — Opus's two extra phases still don't close Fable's per-phase 2×",
  "seriesA": "Fable 5",
  "seriesB": "Opus 4.8",
  "items": [
    { "label": "triage", "a": 1.0222, "b": 0.9658, "aDisplay": "$1.02", "bDisplay": "$0.97" },
    { "label": "plan-fix", "b": 0.2946, "bDisplay": "$0.29" },
    { "label": "review-plan", "b": 0.4851, "bDisplay": "$0.49" },
    { "label": "implement", "a": 0.7506, "b": 0.4899, "aDisplay": "$0.75", "bDisplay": "$0.49" },
    { "label": "review-code", "a": 0.9059, "b": 0.4744, "aDisplay": "$0.91", "bDisplay": "$0.47" },
    { "label": "approve", "a": 0.4979, "b": 0.2738, "aDisplay": "$0.50", "bDisplay": "$0.27" },
    { "label": "commit", "a": 0.1507, "b": 0.1080, "aDisplay": "$0.15", "bDisplay": "$0.11" }
  ],
  "total": "Σ Fable <strong>$3.33</strong> / 5 phases · Opus <strong>$3.09</strong> / 7 — fewer phases, bigger bill"
}

And if you're wondering where the money actually goes inside a single run, it is not where the tokens are. Opus moved 1.92M cache-read tokens — 91% of every token it was billed for — but those cheap re-reads are only a third of the bill. The 53K output tokens, two and a half percent of the volume, are the single biggest line item:

{
  "figure": "FIG. 7",
  "heading": "WHERE THE $3.09 GOES",
  "title": "Opus run — cost by billing bucket (volume and bill rank in opposite orders)",
  "items": [
    { "label": "output", "value": 1.3256, "display": "$1.33" },
    { "label": "cache reads", "value": 0.9575, "display": "$0.96" },
    { "label": "cache writes", "value": 0.8076, "display": "$0.81" },
    { "label": "base input", "value": 0.0009, "display": "$0.001" }
  ],
  "total": "Σ <strong>$3.09</strong> — 53K output tokens (2.5% of volume) are 43% of the bill; 1.92M cache reads (91% of volume) are 31%. <strong>Rate, not volume, decides the damage.</strong>"
}

The run I threw out (and what it taught me about turns)

There was a third run, and it's why I don't fully trust the turn counts above until I look underneath them. My first Opus attempt landed on a broken image: make git died because an unrelated optional Rust subsystem needed cargo, which wasn't installed (make: *** [...libgitcore.a] Error 127). Opus found and applied the correct fix early, then spent much of its implement phase fighting a build it couldn't run — diagnosing the missing toolchain, and finally writing a standalone test harness to validate its fix by pure reasoning, because it could not compile git to test it. It produced the canonical fix to a state-table bug it never got to execute. Striking to watch — and it inflated that run's implement phase to 29 turns of mostly failed builds.

@caption: the discarded run — make git on the broken image
$ make git
    CC compat/fsmonitor/fsm-path-utils-linux.o
    AR libgit.a
 CARGO target/release/libgitcore.a
/bin/sh: 1: cargo: not found
make: *** [Makefile:3020: target/release/libgitcore.a] Error 127

That is exactly why total turns is a number to handle with tongs. When I re-ran Opus on a working image, the implement phase dropped from 29 turns to 20 — right next to Fable's 17 — confirming the inflation was the environment, not the model. The per-phase work was comparable all along. What moved the total wasn't chattiness; it was triage choosing a longer path. Two of my three Opus-class runs didn't even agree with each other on which path to take. A metric that swings on a build error and a routing coin-flip is not a measure of model capability. Cost, driven by a fixed rate card, is the stable one — and it says the same thing every time: Fable is the expensive seat.

Caveats

  • n = 1 per model, one bug. This is a directional pilot, not a statistical claim. A real capability verdict needs multiple bugs of varying kinds and repeated trials.
  • Triage path varies, and it dominates total turns. On the identical image, Opus took the seven-phase path and Fable the five-phase one, and two of my Opus runs disagreed with each other. Compare per-phase work and compare cost; don't rank models on total turn count from n=1.
  • "Capability" here means the pipeline produced a judge-verified canonical fix. I did not score fix quality beyond correctness and regression-safety on this one bug.
  • Everything is specific to this harness, this bug, and the model pricing on the run date.

What I'd tell anyone benchmarking a coding agent

The model comparison was the cheap part. The expensive part — and the part that determines whether your numbers mean anything — is building a task that can actually separate good from good-enough, and then proving the task is live before you spend a cent on inference. Plant the bug, then try to make it not matter: throw hundreds of randomized inputs at it, and if the buggy and fixed builds ever agree, you don't have a test, you have a story. My first bug taught me that. The second one earned its place by surviving the attempt to dismiss it.

About Boni Gopalan

Elite software architect specializing in AI systems, emotional intelligence, and scalable cloud architectures. Founder of Entelligentsia.

Entelligentsia Entelligentsia