Model + Harness Leaderboard · per workflow

The best model depends on the workflow.

You don't bet against Claude. You route Claude where Claude is needed, and run cheaper stacks where the harness proves they're safe. VibeOps continuously finds the cheapest reliable stack for every engineering decision. Demo Numbers below are illustrative; your benchmark gets generated during the private 100-PR replay.

Stack	Recall	FP	$/PR	Latency	Decision
Raw Claude review (no harness) Comments on style without surfacing contract gaps.	74%	31%	$1.20	28.0s	Too noisy
Claude + VibeOps Integration Harness Contract reasoning + retry/idempotency evals.	88%	18%	$1.55	31.0s	Best quality
Kimi-32B + VibeOps Integration Harness 17x cheaper, recall within 6pts. Route default for stable connectors.	82%	22%	$0.09	11.0s	Sweet spot
SLM only (no harness) Misses contract gaps. Not safe alone.	51%	40%	$0.02	4.0s	Unsafe

Cost vs Trust

Higher = better recall. Lefter = cheaper. Sweet spot is the upper-left.

Best quality

Sweet spot

Too noisy

Unsafe alone

Routing recommendation

For additive integration changes, VibeOps recommends the following stack mix. Routing is per-decision, not per-PR — different parts of the same review go to different tiers.

Kimi/SLM for basic checks → Claude on contract reasoning → owner on ambiguity

Avg cost reduction vs raw Claude

84%

Recall delta on goldset

−6pts

negligible on stable workflows

Latency improvement

2.8×

median p50

The moat · trust router stays right when models change

Model winners will keep changing. The leaderboard above will look different in 90 days — Kimi might catch Claude on reasoning, Gemini might dominate latency, a new SLM might own contract checks. VibeOps becomes the layer that always knows which model, with which harness, for which workflow — because the harnesses are ours, the historical replay is ours, and the trust accounting is workflow-aware. The model layer is a commodity. The trust router is not.