What we're trying to achieve
For any task in Omni or Tastymaestro, know the model + effort + harness rules + guardrails that produce the best result — and prove it with evidence.
We are not benchmarking raw model capability (public benchmarks do that). We are benchmarking model x harness on OUR repos and OUR tasks — because the harness (prompts, gates, guardrails, context management) is the part we control and can improve.
The questions we're answering
What model x effort gives the best result for each task tier?
exploringScout, build, verify, plan, debug are different jobs. The right tool for recon is wrong for architecture. We want the sensible default per tier, not one model for everything.
Can Sonnet — or even Haiku — match Opus with the right harness? Where exactly is the boundary?
openIf a cheaper model hits 95% of frontier quality at 20% of the cost on our tasks, that changes how we spend every day. We deliberately probe a step below the obvious choice to find where it breaks.
Where is each cell's 'dumb zone' — the context size at which it stops being reliable?
openA model that is sharp at 30k tokens may quietly degrade at 200k. Knowing the breakpoint tells us when to chunk, summarize, or switch to a longer-context cell. It is a first-class score, not an afterthought.
Which harness rules and guardrails actually move scores toward the ceiling?
openThis is the whole point. We baseline, change one rule, re-run, and measure the lift. Improvements that don't move a dimension get cut.
Claude vs Codex — where does each win?
openDifferent runtimes have different strengths on tool-calling, instruction-following, and faithful implementation. We want the honest split, per task tier.
What is the best value cell — most quality per dollar — for routine work?
openThe ceiling cell defines 100% but is rarely the right daily driver. Best value is the cell we actually want most agents to run.
Decisions made
Updated 2026-06-24 · edit ui/data/mission.json to evolve this