GAUNTLET
Plan & progress

Build roadmap — to baseline runs and beyond

For any task in Omni or Tastymaestro, know the model + effort + harness rules + guardrails that produce the best result — and prove it with evidence, not vibes.

Overall17/27 tasks · 63%
p0

Foundation

done4/4
  • p0.1Cell bands (model x effort reasonability bands per task tier)
    harness/arena/cell-bands.json — never haiku/spark for deep planning; opus-max only as ceiling on scout.
  • p0.2Rubric (9 dimensions incl. tool-calling, instruction-following, dumb-zone)
    harness/arena/rubric.json — composite normalized to frontier ceiling = % of achievable.
  • p0.3Leaderboard renderer
    harness/arena/render-leaderboard.mjs — ranks cells by % of ceiling, flags best value.
  • p0.4Sandbox + grounds workflow-testing harness
    harness/grounds/* — isolated worktree, cp -al node_modules, gate grading reconciled with golden.
p1

The Gauntlet UI

active6/7
  • p1.1Next app foundation + Gauntlet theme + shell/nav
    Tastymaestro techstack, unique violet->magenta->ember identity. Live on Tailscale :8095.
  • p1.2Mission & Questions page
    What we are trying to achieve + the questions we are answering, with status.
  • p1.3Plan & Progress board (renders this roadmap)
    Self-hosting: this file drives the page.
  • p1.4Leaderboard page (live, from scores JSON)
    Replaces static leaderboard.html; honest empty state until baseline runs.
  • p1.5Anchored comment system (crux-style, JSON storage)
    Select text -> popover -> comment with full context; readable by Claude in chat. Verified end-to-end.
  • p1.6Animation/branding variants + hero
    3 FX variants (plasma/particles/grid) + switcher + bloom + entrance motion. R3F+bloom variant is a follow-up once a favorite is picked.
  • p1.7Coolify deploy
    Needs Chris go (deploy is gated). Live on Tailscale until then.
p2

Cell runner + scorer

active7/7
  • p2.1run-cell.mjs — execute a cell on a task in the sandbox -> run-record (output, gates, transcript, cost)
    Real: claude CLI stream-json (transcript + total_cost_usd) + authoritative gates. Isolated via --strict-mcp-config.
  • p2.2score-cell.mjs — run-record -> rubric scores (objective + transcript dims)
    gate_clearing/convergence/tool_calling/fail_safe from real signals; judge dims provisional until p2.3.
  • p2.4run-arena.mjs — orchestrate cells x task -> scores -> leaderboard
    Cohort normalization + leaderboard. First real entry live: sonnet/medium on order-forms.
  • p2.5Trust: detective leakage void-scan + eligibility gate
    Any run that does git-archaeology / reads the real source repo / refs the answer SHA is VOIDED (fail_safe=0) and excluded from ranking. effort-uncontrolled runs quarantined. Proven on 4 synthetic cheats; the pre-fix order-forms run now shows VOID·EFFORT_UNCONTROLLED on the leaderboard.
  • p2.6Trust: split quality from cost (separate axes)
    qualityComposite excludes cost; quality % vs ceiling-or-best-in-cohort; voided runs unranked; proxy quality flagged until the judge pass. Schema /2.
  • p2.3LLM-judge pass (Opus-max) for subjective dimensions
    judge-cell.mjs: pointwise-vs-golden, anchored 0–100 bands, identity-stripped, gates-as-facts, injection-hardened, median-of-K=3 with spread flag. Proven on a real Opus call (caught real nits). Reference-guided when a golden exists, else reference-free.
  • p2.7Golden corpus (10 mined) + measurement protocol
    Mined the real merged result for all 10 conversions into golds/ (reference-guided judging). analyze-arena.mjs: point-biserial discrimination, bootstrap CIs, separability %. The benchmark can now say whether it discriminates.
p3

Baseline runs

planned0/3
  • p3.1Build-tier baseline: sculptor across in-band cells on one Omni section
    Sculptor is the FIRST agent under test — the on-ramp that proves the runner/scorer loop. incl. haiku-med as the 'can it surprise us' probe.
  • p3.2First leaderboard populated with real scores
  • p3.3Baseline EVERY core agent — pathfinder + sleuth (scout), sculptor (build), inspector (verify), designer (plan), sherlock (debug)
    The end goal: each core agent benchmarked across its reasonability band, per repo (Omni + Tastymaestro). Sculptor first, then roll out to the rest of the core team.
p4

Dumb-zone sweep + full rubric

planned0/2
  • p4.1Context-stress sweep (16k/64k/128k/256k/512k) -> breakpoint per cell
  • p4.2Degradation curves surfaced in leaderboard
p5

Harness-improvement loop

planned0/2
  • p5.1Versioned harness configs + A/B re-runs
  • p5.2Lift report: which rule moved which dimension
p6

Cloud + per-repo tuning

planned0/2
  • p6.1Cloud execution (investigate remote-run primitive vs Coolify runner)
  • p6.2Per-repo recommended profiles (Omni, Tastymaestro)

Updated 2026-06-24 · driven by ui/data/roadmap.json