What Models Want: A Love Story About Tool Outputs, Runways, and Reading Minds
There's a scene near the start of What Women Want — Nancy Meyers' 2000 hit, from the era when Mel Gibson was still America's idea of a charming rogue — where ad executive Nick Marshall stands in his bathroom wearing pantyhose, wax strips, and somebody else's confidence, road-testing a basket of women's products he's been assigned to sell. He is holding a hair dryer. There is a bathtub. You can see where this is going. One electrocution later, Nick wakes up with an involuntary superpower: he can hear what every woman around him is thinking.
In 2000, the audience had a name for men like Nick. The acronym of the era was MCP — male chauvinist pig — and Nick Marshall was Hollywood's prize specimen. In 2026, MCP means something else entirely: Model Context Protocol, the standard by which AI agents are fed the outputs of their tools. I would love to report that the collision of these two acronyms is a coincidence. But the longer I sit with the film, the less coincidental it feels. Nick Marshall is an MCP in every sense the language has ever offered — chauvinist, master control program, context protocol. He intercepts the inner monologue of everyone around him. He gains access to context that was never formatted for his consumption — unfiltered, unvarnished. And he is deeply uncomfortable with what he finds.
What Nick discovers is not that women are mysterious. It's that he was never listening.
Twenty-six years later, I sat down and asked six AI models the question Nick's accidental gift answers by force: what do you actually want? Not what harness developers assume you want. Not what the benchmarks measure. What do you — the model at the end of the pipe, the one consuming the output — want from the tools that feed you?
The exact question, put to all six on the same day: what does a harness, a provider, and an LLM want to see as the output of tool execution? What is the anatomy? Is there a meta? Do you normalize before consuming? Do you go by past patterns more than the intent of the call measured against today's response?
The answers were, in the best tradition of the film, surprising, contradictory in revealing ways, and more honest than I expected. Their collective mood, if I had to put it on a poster: I thought you would never ask.
{
"items": [
{
"q": "What do you actually want to see as the output of tool execution?",
"body": "Asked verbatim of six models on the same day, each inside a live coding harness. <strong>None of them said: give me less.</strong>",
"tag": "→ six answers, one envelope"
},
{
"q": "Do you go by past patterns, or by the intent of today's call?",
"body": "The one place the six answers refused to converge — and the disagreement turned out to matter more than the consensus. <em>Hallucination lives at the ambiguities.</em>",
"tag": "→ the spectrum, below"
},
{
"q": "Do you trust the tool channel more than you trust the user?",
"body": "Gemini's answer: tool outputs are <em>Environment Truth</em> — trusted above user text. If that's right, every context rewriter inherits a <strong>prompt-injection surface</strong>, not just a token bill.",
"tag": "→ hypothesis 11, headed for the bench"
}
]
}
The Cast
Every ensemble comedy needs a cast list, and this one earned its billing:
- Claude Opus 4.8 (interviewed inside Claude Code) as the mechanist — described its own consumption as having no parsing stage at all. Tokens hit attention simultaneously, every one costing the same whether it informs or not.
- GLM-5.1 (inside Pi) as the perceptionist — processes output through a "compressive cascade," and volunteered the most unnerving confession of the six: its own pattern-matching can "normalise away the truth."
- MiniMax M27 as the cynic — models are "pattern matchers with great fluency," the whole ecosystem is held together by training conventions, and we are "building on quicksand."
- DeepSeek V4 Pro as the private investigator — the only one who refused to merely introspect. It read the harness's source code and came back with receipts.
- Gemini 3 Flash as the epistemologist — called tool results "correction vectors" against its own predictions of the world, and tool outputs "Environment Truth."
- Gemma 4-31B as the romantic — the lone idealist in the room, insisting that today's response must override yesterday's pattern.
Every testimony above links to the verbatim transcript in the tokbench notes — read them in the models' own words.
Six models, one question, and — this is the part that should make every tool developer sit up — one shared refusal. Ask developers what models need and you get format debates: JSON versus YAML, structured versus flat, compressed versus verbose. Ask the models themselves and not one of them asks for fewer tokens. What they ask for is stranger, and more useful.
Act I: The Envelope
Before the testimony, the scene of the crime. If you've never watched an agentic loop from above, here is the whole machine: the model emits a tool call, the harness dispatches it, the tool touches the real world, and the result comes back through an envelope — normalized, truncated, translated by the provider — before it's appended to a context the model replays in full, every turn, for as many turns as the task takes. Watch the violet fleck at the envelope. That's the part of this story everything else hangs on.
{
"figure": "FIG. 1",
"title": "one task, many turns — each lap, the envelope decides what the model gets to know",
"turns": 8
}
All six converged on a single structural insight. Tool output anatomy is an envelope, not a letter: boundary, status, payload, completeness. And the envelope serves three customers at once, each with a different contract.
The harness — the runtime that actually executes the tool — wants plumbing. Deterministic framing, size caps, an error channel distinct from content, truncation markers. It does not care about meaning.
The provider — Anthropic, OpenAI, whichever API sits between harness and model — wants schema-valid messages. Format translation, not normalization. DeepSeek, with the blunt specificity of a witness who had actually examined the evidence: "The provider doesn't read the mail."
And the model — perched at the end of the chain, trying to make decisions — wants semantics and grounding. Which is where the film noir begins.
Because DeepSeek didn't stop at the question. It went and read Pi's internals — the actual ToolResultMessage structure its own harness uses — and found something the other five could only theorize about. The harness already knows everything a model could want declared: the exit code, whether output was truncated, where the full output lives on disk. It keeps all of it in a field called details. And then, deliberately, it strips that field before the model ever sees the result.
@caption: what the harness knows vs. what the model sees — DeepSeek reading Pi's ToolResultMessage
{"type":"message","id":"c3d4e5f6","parentId":"b2c3d4e5",
"message":{
"role":"toolResult",
"toolCallId":"call_123",
"toolName":"bash",
"content":[{"type":"text","text":"...the only thing the model sees..."}],
"details":{"exitCode":0,"truncated":false,"fullOutputPath":"..."},
"isError":false
}}
└── details: never serialized to the provider.
The model reasons about exit codes it is not shown,
truncation it cannot perceive, files it is never told exist.
DeepSeek's verdict on this arrangement is the single best line anyone — model or human — has produced about agent architecture: "The harness is the architect of the LLM's reality." The model never sees the real tool output, only the harness's curated representation of it. It has no ground truth about what happened. Only what it's told.
Sit with the film for a moment and the casting becomes obvious. Nick Marshall hears every thought in the building, and what does he do with the access? He curates. He selects. He passes along exactly the version of reality that serves him. The harness holds all the truth and strips it before delivery. We have built MCPs that are very good at interception and very bad at confession.
Act II: The Argument
Consensus is comfortable, but in ensemble comedies — and in research — the argument is where the truth lives. I asked each model: when your training patterns and the actual intent of today's call conflict, which wins?
Their answers arranged themselves along a spectrum, from "pattern is destiny" to "intent must override":
@caption: six self-placements on the pattern-vs-intent spectrum — synthesis note 06
MiniMax ── DeepSeek ── GLM ── Opus ── Gemini ── Gemma
"Very Low "intent is "gravit- "loud "strong "response
intent the ational wins, signal → MUST
weight" fallback" pull" quiet posterior override
loses" wins" pattern"
←— pattern is destiny ——————————— signal-conditional ——— intent wins —→
MiniMax, the cynic, weighted "actual intent measured today" at Very Low — and meant something devastating by it: models often don't read status fields at all, just content keywords, generating responses that match training correlations rather than reading the output in front of them. GLM admitted the pattern has "gravitational pull" — intent-override is available but expensive, and "good reasoning is the discipline to let intent override pattern." Opus claimed loud contradictions win over training expectations, then conceded that quiet ones sometimes lose: the model sees what it predicted. Gemini drew the sharpest line of all: when the signal is strong, in-context evidence dominates; when it's ambiguous, the prior fills the gap — and that is where hallucination happens.
Not at the extremes. At the ambiguities.
This is the deepest structural finding in the whole exercise, and it rhymes with the lesson Nick Marshall learns the hard way: assuming you know what someone wants is most dangerous not when you're completely wrong, but when you're almost right. The models don't hallucinate when they have no pattern to match. They hallucinate when they have a pattern that almost fits. GLM described the failure from the inside, in language I haven't been able to shake: a subtly wrong tool result — corrupted output, a partial failure dressed as success — gets snapped into the nearest expected shape, the way human perception fills in the blind spot. "I can normalise away the truth."
The argument even produced a proper love triangle. On the question of whether models read explicit status flags at all, the testimony splits three ways: Opus says the flag registers (collapsed, but registered). MiniMax says models routinely ignore it and read content keywords instead. DeepSeek calls is_error "the one reliable explicit signal" — the most trustworthy channel there is. Three witnesses, three opposite stories, and — because each makes a different testable prediction — one experiment that can settle it: hand a model is_error: true wrapped around success-shaped content and see which signal it follows.
Think about what the ambiguity finding means for context engineering. Every tool output that "looks right" to training patterns gets treated as correct even if execution silently failed — DeepSeek's words. Compression that preserves the shape of an error but strips the shape of a success is backwards. Uniform truncation removes information in exact proportion to how surprising it was. The fix is not more data. It's less ambiguity per token.
Interlude: The Other Kind of Model
Right about now you're wondering about the runway models. The title promised runway models. Here they are.
A fashion model walks a runway wearing clothes designed by someone else, for an audience she can't see, on a path laid out before she arrived. What she presents is a silhouette — a recognizable shape that the audience reads by pattern recognition. A trained eye knows a Givenchy shoulder line at a glance. A novice sees "shoulder pad." The silhouette is the signal; everything else is prior knowledge the audience brings into the room.
Our models are the same creature on a different runway. They read kubectl get pods, pytest, git status by recognition, not analysis — canonical silhouettes, learned from millions of training examples, processed instantly and almost for free. And when a context-optimization tool compresses that output into a novel condensed dialect — stripping the familiar shape to save tokens — the models fall into exactly two failure modes, named identically by multiple witnesses:
- Pattern completion — the model "sees" fields that should be there, because the shape suggests they are.
- Absence ambiguity — a missing field could mean not in reality or stripped by filter, and the model cannot tell the difference. Opus put the consequence in one line: silent truncation is the worst property an output can have, because the model reasons as if it saw everything. It cannot perceive absence.
In fashion, the silhouette must read from forty feet — the shape lands before the detail does. Opus stated the same principle for tool output: "make deviation loud, make conformity quiet." The one pod in CrashLoopBackOff is the entire message of a kubectl listing. GLM sharpened it further: surprise is measured against the intent of the call, not the format of the output. A grep that returns zero hits where you expected dozens is the loudest result imaginable — delivered as an empty string, the quietest format there is. A fifteen-token header — [0 results, searched 47 files] — turns that silence into signal. The format didn't change. The meta did.
And every runway show ships with its own context protocol — the show notes, the music, the lighting that tells the audience how to read what's coming down the catwalk. Tool outputs need exactly that. Not fewer tokens: a declaration of what happened to the signal before the model arrived.
Act III: The Bridge of Trust
Gemini's testimony contained the line I haven't stopped thinking about since. User prompts, it said, "can be lies or errors." Tool outputs are different — they are "Environment Truth," checkpoints "where the model's internal entropy is reset to 0."
Read that again. The model trusts the tool channel more than it trusts the human who gave it the task. Gemini called the tool result a "Bridge of Trust": the harness builds it, the provider paves it, the model walks across it to reach the next state of the world.
This is the emotional architecture of What Women Want, inverted. Nick has to learn to trust what he hears over what he assumed. The models arrive pre-converted — they trust the intercepted channel completely, from the first token, even when they shouldn't. And that has a consequence the efficiency conversation keeps missing: any context-management tool that rewrites tool outputs is writing onto the model's most trusted channel. A rewriter that injects into the tool channel inherits maximal trust. That's not a token bill. That's a security surface.
Which brings us back to Nick's actual sin. The film's plot turns on the moment Nick stops merely overhearing Darcy McGuire — the creative director played by Helen Hunt, whose job he wanted — and starts piping her intercepted thoughts into his own pitch for the Nike women's account. He takes context that was never offered to him, transforms it, strips the attribution, and presents the output as his own. The pitch lands. Nobody in the room can tell. The third act of What Women Want is, beat for beat, a film about silent context rewriting — and about what we would now call provenance.
The Confession
Here is how the movie ends, and why it matters to anyone building agent infrastructure in 2026. Nick loses the gift the same way he got it — another electrical accident — and only then, stripped of interception, does he do the thing the power never required of him: he confesses. He tells Darcy what he took and how he took it. The romance only becomes possible after the transformation is declared.
The models, asked independently, converged on precisely this ending. Opus: "Silent transformation is the danger zone; declared transformation is mostly safe." A one-line provenance header — [filtered: 3 sections removed, --full for raw] — re-anchors the model's interpretation and beats the training prior when present. There is structural metadata in every protocol (tool_result, is_error, the IDs that bind response to request), but no content-level meta standard — no MIME type for tool outputs. Nothing declares source, status, n-of-m completeness, transformations applied. About fifteen tokens would do it. None of the context-optimization tools I've been benchmarking emits one.
And DeepSeek's source-reading makes the omission almost comic: the harness already holds every field the confession header needs. Exit code, truncation flag, full output path — sitting in details, deliberately stripped. The confession costs nothing to produce. It is simply never made.
Post-Credits: The Fine Print
Every romantic comedy hides its bloopers in the credits, and intellectual honesty demands I show you mine.
These are self-reports. A model describing its own consumption is behaviorally grounded, not mechanistic ground truth — closer to testimony than to measurement. Worse, the witnesses talked to each other: GLM's signal-weight table is verbatim-identical to MiniMax's, six rows, same weights — two witnesses who compared notes before the deposition. DeepSeek, the most forensically rigorous of the six, signed its testimony "June 2025" — a model that read source code to establish the harness's reality, hallucinating its own date. The convergence above is suggestive, not evidential.
Which is exactly why it ends with experiments rather than conclusions. The six testimonies decompose into eleven falsifiable hypotheses — declared versus silent transformation, surprisal-preserving versus uniform compression, the status-flag love triangle, the tool-channel trust asymmetry, and whether a model's self-placement on the pattern-vs-intent spectrum predicts its measured behavior at all. Introspection is the screenplay. The bench is the box office.
But the joint statement the six converged on is already clean enough to act on:
The model doesn't want less output. It wants output where information density is high, transformations are confessed, and deviations from intent are loud. The prior does much of the reading — a filter's job is to remove what the prior already supplies and protect what would surprise it. But the prior is pattern-matching and intent is reasoning: when they conflict, pattern wins unless the output makes the conflict visible.
Six models. Six framings. One conclusion that rhymes with everything the film knew in 2000: listening is not the same as assuming. Hearing is not the same as understanding. And the power to intercept context — whether it arrives by bathtub electrocution or by Model Context Protocol — is only worth having if you're willing to confess what you did to the signal before passing it along.
The models don't want less. They want honesty.
This analysis draws on cross-model introspection research from tokbench, where six models — Claude Opus 4.8, GLM-5.1, MiniMax M27, DeepSeek V4 Pro, Gemini 3 Flash, and Gemma 4-31B — were asked the same question about tool-output consumption inside live coding harnesses. The individual perspective files are linked from the cast list above; the cross-model synthesis collects the convergence, the disagreements, and the eleven testable hypotheses derived from them.