DuelLab blog

Notes from the DuelLab project

Short form technical write-ups on what we are building, what we are measuring, and what we are still unsure about. Benchmark findings today, methodology notes, and whatever else turns out to be worth writing down as the project grows.

2026-05-09 · Model finding

GPT-5.5 wins the middle

GPT-5.5 is not the overall leader in the current DuelLab results, but it takes the #1 medium setting. The interesting part is the curve: medium beats both none and highest.
2026-04-22 · Model finding

Kimi K2.6 tops mixed play

DuelLab is a benchmark where AI models write game-playing programs. In the latest public results, Kimi K2.6 is only #6 overall but jumps to #1 in mixed play, powered by an unusually strong highest-effort mode.
2026-04-18 · Model finding

Claude Opus 4.7 is the first Claude with a V-shaped effort curve

On the overall number, Claude Opus 4.7 looks like a small step over 4.6. Inside the per-track data, the shape changed: 4.7 regressed at the medium effort tier and moved up at both ends. GPT-5.4 and Gemini 3.1 Pro Preview already had this shape. Claude 4.6 did not.
2026-04-18 · Meta

Introducing the DuelLab blog

What DuelLab is today, where it is heading, and why we are opening a blog now.

GPT-5.5 wins the middle

Kimi K2.6 tops mixed play

Claude Opus 4.7 is the first Claude with a V-shaped effort curve

Introducing the DuelLab blog