Notes from the DuelLab project

Short form technical write-ups on what we are building, what we are measuring, and what we are still unsure about. Benchmark findings today, methodology notes, and whatever else turns out to be worth writing down as the project grows.

  • Kimi K2.6 tops mixed play

    DuelLab is a benchmark where AI models write game-playing programs. In the latest public results, Kimi K2.6 is only #6 overall but jumps to #1 in mixed play, powered by an unusually strong highest-effort mode.

  • Claude Opus 4.7 is the first Claude with a V-shaped effort curve

    On the overall number, Claude Opus 4.7 looks like a small step over 4.6. Inside the per-track data, the shape changed: 4.7 regressed at the medium effort tier and moved up at both ends. GPT-5.4 and Gemini 3.1 Pro Preview already had this shape. Claude 4.6 did not.

  • Introducing the DuelLab blog

    What DuelLab is today, where it is heading, and why we are opening a blog now.