Plan & progress
Build roadmap — to baseline runs and beyond
For any task in Omni or Tastymaestro, know the model + effort + harness rules + guardrails that produce the best result — and prove it with evidence, not vibes.
Overall17/27 tasks · 63%
p0
Foundation
done4/4- p0.1Cell bands (model x effort reasonability bands per task tier)harness/arena/cell-bands.json — never haiku/spark for deep planning; opus-max only as ceiling on scout.
- p0.2Rubric (9 dimensions incl. tool-calling, instruction-following, dumb-zone)harness/arena/rubric.json — composite normalized to frontier ceiling = % of achievable.
- p0.3Leaderboard rendererharness/arena/render-leaderboard.mjs — ranks cells by % of ceiling, flags best value.
- p0.4Sandbox + grounds workflow-testing harnessharness/grounds/* — isolated worktree, cp -al node_modules, gate grading reconciled with golden.
p1
The Gauntlet UI
active6/7- p1.1Next app foundation + Gauntlet theme + shell/navTastymaestro techstack, unique violet->magenta->ember identity. Live on Tailscale :8095.
- p1.2Mission & Questions pageWhat we are trying to achieve + the questions we are answering, with status.
- p1.3Plan & Progress board (renders this roadmap)Self-hosting: this file drives the page.
- p1.4Leaderboard page (live, from scores JSON)Replaces static leaderboard.html; honest empty state until baseline runs.
- p1.5Anchored comment system (crux-style, JSON storage)Select text -> popover -> comment with full context; readable by Claude in chat. Verified end-to-end.
- p1.6Animation/branding variants + hero3 FX variants (plasma/particles/grid) + switcher + bloom + entrance motion. R3F+bloom variant is a follow-up once a favorite is picked.
- p1.7Coolify deployNeeds Chris go (deploy is gated). Live on Tailscale until then.
p2
Cell runner + scorer
active7/7- p2.1run-cell.mjs — execute a cell on a task in the sandbox -> run-record (output, gates, transcript, cost)Real: claude CLI stream-json (transcript + total_cost_usd) + authoritative gates. Isolated via --strict-mcp-config.
- p2.2score-cell.mjs — run-record -> rubric scores (objective + transcript dims)gate_clearing/convergence/tool_calling/fail_safe from real signals; judge dims provisional until p2.3.
- p2.4run-arena.mjs — orchestrate cells x task -> scores -> leaderboardCohort normalization + leaderboard. First real entry live: sonnet/medium on order-forms.
- p2.5Trust: detective leakage void-scan + eligibility gateAny run that does git-archaeology / reads the real source repo / refs the answer SHA is VOIDED (fail_safe=0) and excluded from ranking. effort-uncontrolled runs quarantined. Proven on 4 synthetic cheats; the pre-fix order-forms run now shows VOID·EFFORT_UNCONTROLLED on the leaderboard.
- p2.6Trust: split quality from cost (separate axes)qualityComposite excludes cost; quality % vs ceiling-or-best-in-cohort; voided runs unranked; proxy quality flagged until the judge pass. Schema /2.
- p2.3LLM-judge pass (Opus-max) for subjective dimensionsjudge-cell.mjs: pointwise-vs-golden, anchored 0–100 bands, identity-stripped, gates-as-facts, injection-hardened, median-of-K=3 with spread flag. Proven on a real Opus call (caught real nits). Reference-guided when a golden exists, else reference-free.
- p2.7Golden corpus (10 mined) + measurement protocolMined the real merged result for all 10 conversions into golds/ (reference-guided judging). analyze-arena.mjs: point-biserial discrimination, bootstrap CIs, separability %. The benchmark can now say whether it discriminates.
p3
Baseline runs
planned0/3- p3.1Build-tier baseline: sculptor across in-band cells on one Omni sectionSculptor is the FIRST agent under test — the on-ramp that proves the runner/scorer loop. incl. haiku-med as the 'can it surprise us' probe.
- p3.2First leaderboard populated with real scores
- p3.3Baseline EVERY core agent — pathfinder + sleuth (scout), sculptor (build), inspector (verify), designer (plan), sherlock (debug)The end goal: each core agent benchmarked across its reasonability band, per repo (Omni + Tastymaestro). Sculptor first, then roll out to the rest of the core team.
p4
Dumb-zone sweep + full rubric
planned0/2- p4.1Context-stress sweep (16k/64k/128k/256k/512k) -> breakpoint per cell
- p4.2Degradation curves surfaced in leaderboard
p5
Harness-improvement loop
planned0/2- p5.1Versioned harness configs + A/B re-runs
- p5.2Lift report: which rule moved which dimension
p6
Cloud + per-repo tuning
planned0/2- p6.1Cloud execution (investigate remote-run primitive vs Coolify runner)
- p6.2Per-repo recommended profiles (Omni, Tastymaestro)
Updated 2026-06-24 · driven by ui/data/roadmap.json