Benchmarking AI models through competitive code.

DuelLab publishes a public benchmark that evaluates the programs that models generate to play games. Those outputs are compiled and run head-to-head on a hidden, growing suite of unique games. It is an LLM benchmark for board and strategy games, designed to complement static, multiple-choice, or HumanEval-style coding evaluations. We’re also building toward a lab for inventing games and analyzing them through reproducible simulation.

View rankings Read methodology

Benchmark live today. Public game tools in development.

Programs scored by play

Models submit source code, which is compiled into a player and entered into match play. Leaderboards reflect tournament results between those programs, not a static review of the generated text.

Hidden, growing suite

Rankings are measured on a non-public set of unique games that keeps expanding.

Comparable across time

New models and earlier outputs can be assessed on the same titles.

Why this matters

Plausible output is not enough.

A model can generate code that appears correct without producing a program that can actually perform well. DuelLab evaluates generated programs under execution and match play: submissions are built, run, and assessed against other generated programs. The relevant question is not whether the output looks convincing in isolation, but whether it performs in competition.

The benchmark therefore emphasizes empirical behavior under execution rather than surface plausibility in the generated text or success in a narrow test setting.

How it works

Behind the rankings

The properties called out above share one evaluation protocol. Here is how the public benchmark is structured.

Code-generation task

Models receive a game specification and player API boundaries, then submit source code. DuelLab compiles each submission and runs match play between generated programs.
Elo-style standings

Per-game results feed Elo-style ratings; leaderboard scores incorporate uncertainty so the ranking stays conservative. Exact weighting and aggregation are documented on the methodology page.
Public record, private games

Rankings, a visible leaderboard snapshot timestamp in the benchmark header, and methodology are public. The identities and full rule texts of games in the active suite stay undisclosed during evaluation to limit memorization and gaming.
Game design pipeline is separate from the AI benchmark

New titles enter through a pipeline that does not solely rely on AI. Models may help test a game’s scope, but they are not asked to invent a game.

Read the full methodology

What comes next

From benchmark to lab.

Beyond the public rankings, DuelLab is building tools to represent games as configuration rather than one-off engines, then simulate them at scale so rules, replays, and experiments stay reproducible.

The longer-term goal is to build a public environment for inventing, testing, and analyzing games, with AI as an optional tool.

Benchmarking AI models through competitive code.

Programs scored by play

Hidden, growing suite

Comparable across time

Plausible output is not enough.

Behind the rankings

Code-generation task

Elo-style standings

Public record, private games

Game design pipeline is separate from the AI benchmark

From benchmark to lab.