Pi-Ralph Is Smaller Than It Looks and Bigger Than It Acts
There is a peculiar honesty to a 900-line agent. It cannot pretend to be more than it is, and the discipline of staying small forces the architecture to reveal itself plainly. Entelligentsia's pi-ralph — a pi extension that runs a Generator → Critique → Judge loop against any user-supplied goal — fits in roughly that footprint. It is, by any reasonable measure, a toy. It is also, on closer reading, a quietly subversive demonstration of what a minimal agentic scaffold can become once you take its host harness seriously.
The temptation with a project this compact is to admire its restraint and move on. That would be a mistake. The interesting questions about pi-ralph are not in what it does but in what it affords — the production-grade behaviors that are absent from the code today and one config change, or one creative prompt, away from being present tomorrow.
A complete run, on one page
Before any architecture, watch a real run. The transcript below is the verbatim output of pi-ralph chasing a deliberately hard goal: a one-sentence story under ten words that out-Hemingways Hemingway. The first pane is the setup the scaffold builds before the loop ever spins: feasibility verdict, identified domain, and the three system prompts the meta-prompt step writes for the Generator, Critique, and Judge. Every word of those prompts is constructed by the meta-prompt LLM call from the goal alone — pi-ralph itself ships no domain knowledge.
Now the loop itself. Three iterations of Generator → Critique → Judge, each message streamed to chat the moment it lands. Watch what happens between iterations two and three — the win is a single-word swap.
Nothing in either pane is illustrative-only. Both are drawn from the project's run-log.md, which preserves the full transcript. The visible feedback loop is the product — and watching "empty" become "painted" between iterations two and three shows the critic doing real work, not pattern-matching applause.
What the harness hands you for free
Strip pi-ralph back to its skeleton and you find four moves:
- A feasibility check that asks an LLM whether the user's goal can be meaningfully pursued in text.
- A meta-prompt that takes the goal and generates three domain-tailored system prompts — one each for a Generator, a Critique, and a Judge.
- A bounded loop (default three iterations) that runs the triad, feeding each round's result and criticism back into the next Generator call.
- A verdict, surfaced as the loop's last word.
That is the whole architecture:
flowchart TD
Goal["User goal"] --> Feas{"Feasibility check"}
Feas -->|"not achievable"| Stop["Refuse and explain"]
Feas -->|"achievable"| Meta["Meta-prompt generates<br/>domain-tailored prompts"]
Meta --> Loop["Loop start"]
Loop --> Gen["Generator produces result"]
Gen --> Crit["Critique identifies gaps"]
Crit --> Judge{"Judge done?"}
Judge -->|"yes"| Verdict["Final verdict and output"]
Judge -->|"no, under max"| Feedback["Feed result and criticism<br/>back to Generator"]
Feedback --> Gen
Judge -->|"no, at max"| Verdict
Roughly eleven LLM calls for the default configuration. Each intermediate step is posted to chat as it arrives, so the loop doubles as its own audit trail. There is nothing clever here, and that is the point.
Installable from either registry:
- npm:
@entelligentsia/pi-ralph - pi.dev:
@entelligentsia/pi-ralph
The whole thing is small enough to read in one sitting. The source map:
index.ts— extension entry point, registers the/ralph-loop-anythingcommandagents/orchestrator.ts— coordinates feasibility, meta-prompt, loop, verdictagents/generator.ts— produces or improves a resultagents/critique.ts— evaluates a result against the goalagents/judge.ts— returns{done, reason}JSONprompts.ts— the feasibility prompt, the meta-prompt, and fallback static promptsllm.ts— thepi-aicomplete()wrapper and JSON-response parser
What makes the brevity possible is everything pi-ralph doesn't implement. The pi harness — Earendil's minimal coding-agent framework — supplies the model registry, the API-key plumbing, the streaming infrastructure, the abort signals, the message renderer, the command registration. Pi-ralph is small because all of the genuinely tedious work has already been done by the platform it extends. The extension's job is to compose primitives, not to build them.
This split matters more than it first appears. The dominant story in agentic-system tutorials is that the orchestration code is the interesting part. Pi-ralph quietly inverts that: the orchestration is a few hundred lines of glue, and the interesting part is the surface area the harness exposes. Two pieces of that surface, in particular, change what the scaffold can become.
Latent property one: every subagent can pick its own model
In pi-ralph as written, the Generator, Critique, and Judge all share whatever model the user selected with /model. This is a default, not a constraint. The pi runtime maintains a registry of fifteen-plus providers and hundreds of models, and each call to oneshotLLM reads a ctx.model that could, in principle, be swapped per role.
This is not a small thing. The standard critique of LLM-as-judge architectures — the one I would have made of pi-ralph in a different draft — is that the Generator and Judge share blind spots. If both are the same Claude or the same GPT, the Judge inherits the Generator's confabulations and rubber-stamps them. The literature on cross-model evaluation suggests this is more than theoretical: same-family judges are reliably more lenient on same-family outputs.
The fix is a topology change, and the harness already permits it:
Pi-ralph today — all three roles share one model:
flowchart LR
G1["Generator"] --> M1["Claude Opus"]
C1["Critique"] --> M1
J1["Judge"] --> M1
A multi-model fork — each role bound to a model chosen for its job:
flowchart LR
G2["Generator"] --> M2A["Claude Opus — quality"]
C2["Critique"] --> M2B["GPT-5 — different blindspots"]
J2["Judge"] --> M2C["Haiku — cheap parser"]
A production fork of pi-ralph would let the Generator run on the strongest available model (where quality dominates cost), bind the Critique to a different family entirely (to surface failures the Generator's family is structurally unable to see), and route the Judge to a cheap, fast model that does little more than parse the criticism and return a clean {done, reason} JSON. The eleven-call budget becomes a tunable cost-curve rather than a fixed number, and the same-model collusion problem softens considerably.
Pi-ralph does not ship this. The harness makes it a ten-line change, not a refactor. That distance — between what the code does and what the platform allows — is the more interesting object of study than either the code or the platform on its own.
Latent property two: bash is unabashed, and prompts can write tools
The second affordance is more dangerous and, in its way, more radical.
The Generator in pi-ralph today writes prose. Sometimes that prose looks like code — the README's examples include a single-file SQLite clone in C and a lock-free hashmap in Zig — but nothing compiles it, nothing runs it, nothing tests it. The Judge is grading text-about-code, which is a weaker thing than grading code.
But pi exposes bash. Unrestricted, in-process bash. And the meta-prompt step has already established that one LLM call can produce the system prompt for another. The chain extends naturally:
flowchart TD
U["User submits goal"] --> M["Meta-prompt writes<br/>agent prompts"]
M --> I1["Iteration 1"]
I1 --> G1["Generator writes verifier<br/>and candidate solution"]
G1 --> B1["Bash executes verifier<br/>against candidate"]
B1 --> C1["Critique reads real output<br/>not a hallucination"]
C1 --> J1{"Judge done?"}
J1 -->|"no, fix failing test"| I2["Iteration 2"]
I2 --> G2["Generator patches both<br/>candidate and verifier"]
G2 --> B2["Bash re-executes<br/>exit code 0"]
B2 --> J2{"Judge done?"}
J2 -->|"yes, verified"| Done["Verdict returned to user"]
A loop that started as a text-improvement engine becomes a self-extending one. The Generator can, mid-run, decide it lacks a tool, manifest the tool, and have the next iteration's Critique use it. The feasibility gate that excluded "cook me an omelette" weakens every iteration, because the agent can synthesize the instruments it was missing at the start. By iteration two, the Judge is no longer evaluating prose; it is evaluating exit codes.
The danger framing here is real and worth keeping in view. Arbitrary bash, plus LLM-authored code, plus automatic execution, is the textbook setup for prompt-injection-becomes-arbitrary-code-execution. The same property that makes the loop self-extending makes it self-compromising; a goal crafted by an adversary, or merely a hallucinated subroutine that does the wrong thing with confidence, becomes an exploit surface that no amount of sandbox-after-the-fact will fully contain. The honest version of this section is not isn't this exciting but isn't this exactly the configuration the safety literature has been worried about for three years.
It is, all the same, a real existence proof. The boring 900-line scaffold is one creative prompt away from being a self-modifying system. That fact is more important than any of the technical details above it.
What you still have to build
A balanced view requires inventory of what is genuinely absent — not the things the harness latently allows, but the things any production user will have to add themselves.
Task decomposition. Pi-ralph treats the goal as a single atomic thing. A real agentic system handling, say, "migrate this Rails monolith to a service-oriented architecture" needs sub-goal planning, dependency tracking, and partial-credit accounting. The triad pattern doesn't decompose; it iterates.
Branching and parallel search. The loop is linear. There is no best-of-N candidate comparison, no tree search, no roll-back from a dead branch. For goals where the search space is wide and the local minima are deep, linear iteration is the wrong shape.
Persistence and memory. Each invocation is amnesiac. Re-running a similar goal re-pays the meta-prompt cost and re-discovers the same domain. A modest cache keyed on goal-similarity would pay for itself in a single repeated workflow.
Convergence-aware termination. The loop exits on judge-says-done or max-iterations. Oscillation between two near-identical results, or a plateau where each round improves by 1%, both currently burn calls. Termination on rate-of-change would save real money in practice.
Output schema enforcement. The Judge returns JSON-or-fallback. Everything else returns prose. For any downstream system that consumes pi-ralph's output programmatically, the lack of structured guarantees is a cliff.
Tool composition with the broader ecosystem. pi has fifty-plus example extensions covering everything from SSH execution to sandboxing. Pi-ralph composes none of them. A fork that lets the Generator request specific extensions by name, with permission gates and audit logs, would be a different kind of system.
Drawn against the scaffold, the production perimeter looks like this:
flowchart TD
Title["Production perimeter (you build)"]
Title --> Decomp["Task decomposition"]
Decomp --> Branch["Branching / best-of-N"]
Branch --> Mem["Persistence / memory"]
Mem --> Term["Convergence-aware termination"]
Term --> Schema["Output schema enforcement"]
Schema --> Safety["Tool sandboxing + audit"]
Safety --> Core["Pi-ralph scaffold<br/>around 900 lines of code"]
None of these omissions are oversights. They are deliberate absences in a scaffold whose virtue is its size.
The scaffold as feature
The reason pi-ralph rewards close reading is that it is honest about its category. It is not a framework. It is not a product. It is a recipe — small enough to inspect in an afternoon, opinionated enough to demonstrate a pattern, incomplete enough to invite extension. The Generator/Critique/Judge triad is decades old in various forms; the meta-prompt-generates-prompts move is younger but well-trodden; the bounded loop with streaming intermediate outputs is the obvious shape once you have both. There is no novel research artifact here.
What pi-ralph does offer — and what is worth taking seriously — is a precise sketch of what the smallest reasonable scaffold looks like once you have a harness as capable as pi underneath it. The lesson is not that nine hundred lines is enough for production. It almost certainly isn't. The lesson is that nine hundred lines is enough to demonstrate, with full clarity, where production complexity actually lives: not in the orchestration loop, which is shockingly small, but in the verification, the memory, the decomposition, the cost shaping, and the safety boundaries that surround it.
Adopt the pattern. Fork the repo. But know which production concerns you are inheriting and which you are still on the hook to build. The scaffold is small on purpose. The system you build around it will not be.