GAUNTLET
Mission

What we're trying to achieve

North star

For any task in Omni or Tastymaestro, know the model + effort + harness rules + guardrails that produce the best result — and prove it with evidence.

The premise

We are not benchmarking raw model capability (public benchmarks do that). We are benchmarking model x harness on OUR repos and OUR tasks — because the harness (prompts, gates, guardrails, context management) is the part we control and can improve.

The questions we're answering

01

What model x effort gives the best result for each task tier?

exploring

Scout, build, verify, plan, debug are different jobs. The right tool for recon is wrong for architecture. We want the sensible default per tier, not one model for everything.

02

Can Sonnet — or even Haiku — match Opus with the right harness? Where exactly is the boundary?

open

If a cheaper model hits 95% of frontier quality at 20% of the cost on our tasks, that changes how we spend every day. We deliberately probe a step below the obvious choice to find where it breaks.

03

Where is each cell's 'dumb zone' — the context size at which it stops being reliable?

open

A model that is sharp at 30k tokens may quietly degrade at 200k. Knowing the breakpoint tells us when to chunk, summarize, or switch to a longer-context cell. It is a first-class score, not an afterthought.

04

Which harness rules and guardrails actually move scores toward the ceiling?

open

This is the whole point. We baseline, change one rule, re-run, and measure the lift. Improvements that don't move a dimension get cut.

05

Claude vs Codex — where does each win?

open

Different runtimes have different strengths on tool-calling, instruction-following, and faithful implementation. We want the honest split, per task tier.

06

What is the best value cell — most quality per dollar — for routine work?

open

The ceiling cell defines 100% but is rarely the right daily driver. Best value is the cell we actually want most agents to run.

Decisions made

Score model x harness, not raw capability.
The harness is what we can improve; public benchmarks already cover capability.
Reasonability bands, not full permutation.
Each tier runs a sensible default +/- a step. We never run haiku/spark on deep planning, never opus-max on scouting (except as the ceiling).
Normalize every score to the frontier ceiling (% of achievable).
Makes 'sonnet-med hits 95% of opus-max at 18% of cost' the headline finding.
Context-robustness is its own measured dimension.
The dumb zone is too important to fold into a single quality number.
Tune per repo (Omni, Tastymaestro).
The best profile for a dense trading UI differs from a clinical app; recommendations are repo-specific.

Updated 2026-06-24 · edit ui/data/mission.json to evolve this