SilverquiLLM — Snowfox Builds

Summary

Coding agents keep getting more elaborate — single agents, planner-and-implementer splits, full multi-agent crews. But which setup is actually better? SilverquiLLM answers that with a task that’s genuinely hard and genuinely new: read a brand-new Magic: The Gathering card, implement its real rules in a real game engine, then grade the code against a hidden, audited test suite.

The problem it solves

Agent setups are multiplying faster than our ability to compare them. Most coding benchmarks are either too easy to be interesting or already leaked into training data, so they can’t separate a real improvement from random noise. SilverquiLLM is built for one job: measure the difference between agent setups, end-to-end. Model, harness, skills, prompts — any part of the agent configuration is a dial you can turn, and the benchmark tells you what turning it actually does.

Why Magic: The Gathering

Real-world shape

Implementing a card is messy, stateful, long-horizon work: pages of rules, edge cases that interact, board state that shifts every turn. It’s the shape of real software, compressed into one task.

Turing-complete

Magic is provably Turing-complete — one of the most computationally complex games there is. There’s no ceiling on how hard a card can get, so the benchmark won’t run out of headroom as agents improve.

Resists contamination

Magic ships new sets constantly. Grading on a freshly released set means the answer can’t be memorized from training data — the score measures reasoning, not recall.

How it works

Isolated in Docker

Every agent runs in its own clean container — reproducible and isolated from every other run, so no agent can see another’s work. Same starting line, every time.

Setups as images

Each setup is packaged as its own Docker image, so comparing them is a swap of the image — and the comparison stays fair, down to the byte.

Hidden, audited scoring

Code is graded against a hidden test suite, then audited. Every setup is scored on three axes: implementing the new cards, keeping the existing ones working, and what it costs to get there.

What the first run showed

The first flight proved the instrument is working — and delivered a humbling result. A basic multi-agent setup came in more expensive and less accurate than a single agent, and it took many rounds of tuning just to claw back to parity. More machinery isn’t free performance. V2 sharpens the instrumentation underneath, so genuinely more elaborate setups can be built on solid ground.

Read the full write-up: Initial Findings →

Links

GitHub repo Read: Initial Findings