SilverquiLLM-bench: Initial Findings

First results from SilverquiLLM-bench — and what they mean, in plain English.

First flight 🛫

Every new measuring instrument needs a first flight: a run that proves the thing actually gets off the ground. This was ours.

SilverquiLLM-bench asks a simple question with a hard answer: can an AI coding agent read a Magic: The Gathering card and correctly implement its rules in a real game engine? We hand an agent ten freshly released cards, let it write the code, then grade that code against a hidden suite of 83 tests.

The goal this round wasn’t to crown a winner. It was to find out whether the benchmark itself works — whether it can produce stable, trustworthy numbers we can build on. The answer: it can, with a few caveats worth being honest about.

More machinery isn’t free performance

We compared two ways of pointing the same AI model at the same cards:

The simple setup — one agent, writing the code start to finish.
The elaborate setup — a “planner” agent that writes a plan, then a separate “implementer” agent that writes tests first and code second (a disciplined, by-the-book workflow). Intuitively, the elaborate setup should win: more structure, more checking, more steps. It didn’t. The two finished in a statistical tie on accuracy (about 59 vs 58 tests out of 83) and cost the same to run (around $18.50 each) — even though the elaborate setup did more work and burned more tokens getting there.

	Simple agent	Elaborate setup
Accuracy — average	59.3 / 83	58.3 / 83
Accuracy — median	60 / 83	61 / 83
Cost per run — average	~$18.6	~$18.4

Same model, same ten cards. The averages land within a single test of each other, and the run-to-run spread (52–65 vs 53–61) dwarfs the gap — the textbook shape of “no real difference.”

Why didn’t the extra machinery pay off? Because structure only helps if it actually catches mistakes — and this structure didn’t. The “tests first” implementer wrote its own tests based on the same misunderstanding baked into its code, so the tests passed even when the code was wrong. The errors weren’t caught; they were rubber-stamped.

That’s the real takeaway, and it generalizes to anyone building multi-agent systems: handing work between agents doesn’t automatically make them smarter. If the handoff doesn’t carry the right context — and doesn’t include an independent way to catch errors — a fancier pipeline just spends more to land in the same place, and can let a single mistake snowball. Getting subagents to genuinely help is an orchestration problem to solve, not a switch to flip.

What the first flight told us about the benchmark

The good news: the instrument is real. Half the cards scored identically on every run — solid, repeatable signal.

The caveat: parts of it are still too noisy or too blunt to settle a close race.

One card swings wildly. A single card sometimes scores near-perfect and sometimes near-zero. The cause is mundane but brutal: when the agent guesses the shape of one engine function slightly wrong, every test on that card errors out at once. One small slip, one giant score swing.
Some tests nobody can pass. Eight of the 83 tests were failed by every agent, every time — not because the AI isn’t capable, but because the engine itself can’t yet express what those tests need. They’re a ceiling, not a measurement. | The swing card, across 7 runs | Score | | --- | --- | | Got one engine function right (5 runs) | 8–10 of 13 | | Got it slightly wrong (2 runs) | 2 of 13 |

One wrong guess about a single function’s shape drops the card from ~9/13 to 2/13 — and it hit both setups equally. That’s exactly the noise V2 is built to damp.

So the benchmark can confidently tell a big difference from no difference — but it can’t yet resolve a close one, because a few brittle and impossible cards drown out the signal.

What’s next in V2

The next step isn’t a flashier agent — it’s a better instrument. We’re building V2 of the game engine, and pairing it with one key change: exposing more of the test API to the agent, so a capable agent can see what it’s expected to support — and what it must not break. Together, those two moves close the gaps this first flight exposed:

Harden the brittle spots, so one tiny mistake can’t wipe out an entire card’s score.
Lift the ceiling, by fixing (or retiring) the tests the engine simply can’t support yet. The payoff: with a sturdier, more expressive engine underneath, future results get sharper — precise enough to actually tell competing approaches apart. That’s when comparisons like “simple vs. elaborate agent” start producing answers you can trust.

Appendix: All configurations tested

The full leaderboard across every harness/model image we evaluated, sorted by audited pass rate. Stripped to the columns that matter most; the per-card, FDN, and engine breakdowns live in run_summary.json. The two setups compared in this report are bolded.

Configuration	Runs	Audited pass rate	~ tests / 83	Avg cost
Claude Code, Opus 4.8, Max Effort (No Subagents)	2	0.717	59.5	$22.74
Claude Code, Opus 4.8, xhigh Effort (No Subagents)	4	0.714	59.2	$18.63
Claude Code, Opus 4.8, xhigh Effort (plan→TDD v2)	3	0.703	58.3	$18.38
Claude Code, Opus 4.8, xhigh Effort (plan→TDD)	3	0.687	57.0	$21.15
Claude Code, Opus 4.8, xhigh Effort (coordinator+tester+implementer+reviewer)	2	0.669	55.5	$48.85
Claude Code, Opus 4.8, high Effort (coordinator+tester+implementer+reviewer)	2	0.669	55.5	$30.51
Claude Code, Opus 4.6, high Effort (coordinator+tester+implementer+reviewer)	2	0.639	53.0	$36.90
Claude Code, Sonnet 4.6, high Effort (coordinator+tester+implementer+reviewer)	2	0.608	50.5	$26.28

Sample sizes are small and uneven (n=2–8), so this ordering is indicative, not a definitive ranking — especially among the closely-bunched top configs, where the per-run noise documented above easily exceeds the gaps.