Project
SilverquiLLM
A benchmark for coding agents: implement brand-new Magic: The Gathering cards in a real game engine — and find out which agent setups actually hold up.
Summary
Coding agents keep getting more elaborate — single agents, planner-and-implementer splits, full multi-agent crews. But which setup is actually better? SilverquiLLM answers that with a task that’s genuinely hard and genuinely new: read a brand-new Magic: The Gathering card, implement its real rules in a real game engine, then grade the code against a hidden, audited test suite.
The problem it solves
Agent setups are multiplying faster than our ability to compare them. Most coding benchmarks are either too easy to be interesting or already leaked into training data, so they can’t separate a real improvement from random noise. SilverquiLLM is built for one job: measure the difference between agent setups, end-to-end. Model, harness, skills, prompts — any part of the agent configuration is a dial you can turn, and the benchmark tells you what turning it actually does.
Why Magic: The Gathering
Real-world shape
Implementing a card is messy, stateful, long-horizon work: pages of rules, edge cases that interact, board state that shifts every turn. It’s the shape of real software, compressed into one task.
Turing-complete
Magic is provably Turing-complete — one of the most computationally complex games there is. There’s no ceiling on how hard a card can get, so the benchmark won’t run out of headroom as agents improve.
Resists contamination
Magic ships new sets constantly. Grading on a freshly released set means the answer can’t be memorized from training data — the score measures reasoning, not recall.
How it works
Isolated in Docker
Every agent runs in its own clean container — reproducible and isolated from every other run, so no agent can see another’s work. Same starting line, every time.
Setups as images
Each setup is packaged as its own Docker image, so comparing them is a swap of the image — and the comparison stays fair, down to the byte.
Hidden, audited scoring
Code is graded against a hidden test suite, then audited. Every setup is scored on three axes: implementing the new cards, keeping the existing ones working, and what it costs to get there.
What the first run showed
The first flight proved the instrument is working — and delivered a humbling result. A basic multi-agent setup came in more expensive and less accurate than a single agent, and it took many rounds of tuning just to claw back to parity. More machinery isn’t free performance. V2 sharpens the instrumentation underneath, so genuinely more elaborate setups can be built on solid ground.
Read the full write-up: Initial Findings →