Run Theater

The live cohort, replayed cell by cell — real tool calls stream, real gates flip, real scores land. When a run is in progress it tracks the cell executing now.

Run Theater—

Agent

tool callsattempt 1

loading…

Gates

static lint

eslint

render-proof

mount

Score

—

% of best—

waiting for run data…

1The agent works

A cell (runtime · model · effort) picks up a task and streams tool calls — reads, greps, edits — exactly as captured in its transcript.

2The gates judge

Authoritative gates run against the live app: static lint, eslint, render-proof, vision. A failure triggers a fix loop; convergence counts the attempts.

3The score lands

The rubric scores each dimension, normalized to the frontier ceiling — so you see '% of achievable', and which cheaper cell got close.