Plan & progress

Build roadmap — to baseline runs and beyond

For any task in Omni or Tastymaestro, know the model + effort + harness rules + guardrails that produce the best result — and prove it with evidence, not vibes.

Overall17/27 tasks · 63%

Foundation

done— The spec + measurement scaffolding the whole platform stands on.4/4

p0.1Cell bands (model x effort reasonability bands per task tier)
harness/arena/cell-bands.json — never haiku/spark for deep planning; opus-max only as ceiling on scout.
p0.2Rubric (9 dimensions incl. tool-calling, instruction-following, dumb-zone)
harness/arena/rubric.json — composite normalized to frontier ceiling = % of achievable.
p0.3Leaderboard renderer
harness/arena/render-leaderboard.mjs — ranks cells by % of ceiling, flags best value.
p0.4Sandbox + grounds workflow-testing harness
harness/grounds/* — isolated worktree, cp -al node_modules, gate grading reconciled with golden.

The Gauntlet UI

active— A branded, live, interactive home Chris can see and steer from — mission, plan, leaderboard, comments.6/7

p1.1Next app foundation + Gauntlet theme + shell/nav
Tastymaestro techstack, unique violet->magenta->ember identity. Live on Tailscale :8095.
p1.2Mission & Questions page
What we are trying to achieve + the questions we are answering, with status.
p1.3Plan & Progress board (renders this roadmap)
Self-hosting: this file drives the page.
p1.4Leaderboard page (live, from scores JSON)
Replaces static leaderboard.html; honest empty state until baseline runs.
p1.5Anchored comment system (crux-style, JSON storage)
Select text -> popover -> comment with full context; readable by Claude in chat. Verified end-to-end.
p1.6Animation/branding variants + hero
3 FX variants (plasma/particles/grid) + switcher + bloom + entrance motion. R3F+bloom variant is a follow-up once a favorite is picked.
p1.7Coolify deploy
Needs Chris go (deploy is gated). Live on Tailscale until then.

Cell runner + scorer

active— Run one agent at one (runtime, model, effort) cell on a task; capture a run-record; score it against the rubric.7/7

p2.1run-cell.mjs — execute a cell on a task in the sandbox -> run-record (output, gates, transcript, cost)
Real: claude CLI stream-json (transcript + total_cost_usd) + authoritative gates. Isolated via --strict-mcp-config.
p2.2score-cell.mjs — run-record -> rubric scores (objective + transcript dims)
gate_clearing/convergence/tool_calling/fail_safe from real signals; judge dims provisional until p2.3.
p2.4run-arena.mjs — orchestrate cells x task -> scores -> leaderboard
Cohort normalization + leaderboard. First real entry live: sonnet/medium on order-forms.
p2.5Trust: detective leakage void-scan + eligibility gate
Any run that does git-archaeology / reads the real source repo / refs the answer SHA is VOIDED (fail_safe=0) and excluded from ranking. effort-uncontrolled runs quarantined. Proven on 4 synthetic cheats; the pre-fix order-forms run now shows VOID·EFFORT_UNCONTROLLED on the leaderboard.
p2.6Trust: split quality from cost (separate axes)
qualityComposite excludes cost; quality % vs ceiling-or-best-in-cohort; voided runs unranked; proxy quality flagged until the judge pass. Schema /2.
p2.3LLM-judge pass (Opus-max) for subjective dimensions
judge-cell.mjs: pointwise-vs-golden, anchored 0–100 bands, identity-stripped, gates-as-facts, injection-hardened, median-of-K=3 with spread flag. Proven on a real Opus call (caught real nits). Reference-guided when a golden exists, else reference-free.
p2.7Golden corpus (10 mined) + measurement protocol
Mined the real merged result for all 10 conversions into golds/ (reference-guided judging). analyze-arena.mjs: point-biserial discrimination, bootstrap CIs, separability %. The benchmark can now say whether it discriminates.

Baseline runs

planned— THE milestone: real comparisons populate the leaderboard with evidence — starting with sculptor, ending with the whole core team.0/3

p3.1Build-tier baseline: sculptor across in-band cells on one Omni section
Sculptor is the FIRST agent under test — the on-ramp that proves the runner/scorer loop. incl. haiku-med as the 'can it surprise us' probe.
p3.2First leaderboard populated with real scores
p3.3Baseline EVERY core agent — pathfinder + sleuth (scout), sculptor (build), inspector (verify), designer (plan), sherlock (debug)
The end goal: each core agent benchmarked across its reasonability band, per repo (Omni + Tastymaestro). Sculptor first, then roll out to the rest of the core team.

Dumb-zone sweep + full rubric

planned— Map where each cell breaks down as context grows; complete subjective scoring.0/2

p4.1Context-stress sweep (16k/64k/128k/256k/512k) -> breakpoint per cell
p4.2Degradation curves surfaced in leaderboard

Harness-improvement loop

planned— Baseline -> change prompts/gates/guardrails -> re-run -> measure the lift toward the ceiling.0/2

p5.1Versioned harness configs + A/B re-runs
p5.2Lift report: which rule moved which dimension

Cloud + per-repo tuning

planned— Runs execute in the cloud by default (serial/parallel/scheduled); recommendations fine-tuned per repo.0/2

p6.1Cloud execution (investigate remote-run primitive vs Coolify runner)
p6.2Per-repo recommended profiles (Omni, Tastymaestro)

Updated 2026-06-24 · driven by ui/data/roadmap.json