Run Theater
The live cohort, replayed cell by cell — real tool calls stream, real gates flip, real scores land. When a run is in progress it tracks the cell executing now.
Run Theater—
Agent
tool callsattempt 1
loading…
Gates
static lint
eslint
render-proof
mount
Score
—
% of best—
waiting for run data…
1The agent works
A cell (runtime · model · effort) picks up a task and streams tool calls — reads, greps, edits — exactly as captured in its transcript.
2The gates judge
Authoritative gates run against the live app: static lint, eslint, render-proof, vision. A failure triggers a fix loop; convergence counts the attempts.
3The score lands
The rubric scores each dimension, normalized to the frontier ceiling — so you see '% of achievable', and which cheaper cell got close.