GAUNTLET

Run Theater

The live cohort, replayed cell by cell — real tool calls stream, real gates flip, real scores land. When a run is in progress it tracks the cell executing now.

Run Theater
Agent
tool callsattempt 1
loading…
Gates
static lint
eslint
render-proof
mount
Score
% of best
waiting for run data…
1The agent works

A cell (runtime · model · effort) picks up a task and streams tool calls — reads, greps, edits — exactly as captured in its transcript.

2The gates judge

Authoritative gates run against the live app: static lint, eslint, render-proof, vision. A failure triggers a fix loop; convergence counts the attempts.

3The score lands

The rubric scores each dimension, normalized to the frontier ceiling — so you see '% of achievable', and which cheaper cell got close.