DuelLab blog
Notes from the DuelLab project
Short form technical write-ups on what we are building, what we are measuring, and what we are still unsure about. Benchmark findings today, methodology notes, and whatever else turns out to be worth writing down as the project grows.
-
Kimi K2.6 tops mixed play
DuelLab is a benchmark where AI models write game-playing programs. In the latest public results, Kimi K2.6 is only #6 overall but jumps to #1 in mixed play, powered by an unusually strong highest-effort mode.
-
Claude Opus 4.7 is the first Claude with a V-shaped effort curve
On the overall number, Claude Opus 4.7 looks like a small step over 4.6. Inside the per-track data, the shape changed: 4.7 regressed at the medium effort tier and moved up at both ends. GPT-5.4 and Gemini 3.1 Pro Preview already had this shape. Claude 4.6 did not.
-
Introducing the DuelLab blog
What DuelLab is today, where it is heading, and why we are opening a blog now.