agent-lab · model × harness arena

agent-lab · evaluation platform

GAUNTLET

Run every model and harness through the gauntlet — keep what survives, per task, per repo.

For any task in Omni or Tastymaestro, know the model + effort + harness rules + guardrails that produce the best result — and prove it with evidence.

What we're answering The plan & progress Leaderboard

Phases

0/7

complete

Tasks

0/27

done

Open questions

0 of 6

Updated

2026-06-24

Build phases

The spec + measurement scaffolding the whole platform stands on.

The Gauntlet UI

A branded, live, interactive home Chris can see and steer from — mission, plan, leaderboard, comments.

Cell runner + scorer

Run one agent at one (runtime, model, effort) cell on a task; capture a run-record; score it against the rubric.

THE milestone: real comparisons populate the leaderboard with evidence — starting with sculptor, ending with the whole core team.

Dumb-zone sweep + full rubric

Map where each cell breaks down as context grows; complete subjective scoring.

Harness-improvement loop

Baseline -> change prompts/gates/guardrails -> re-run -> measure the lift toward the ceiling.

Cloud + per-repo tuning

Runs execute in the cloud by default (serial/parallel/scheduled); recommendations fine-tuned per repo.