Model + Harness Leaderboard · per workflow
The best model depends on the workflow.
You don't bet against Claude. You route Claude where Claude is needed, and run cheaper stacks where the harness proves they're safe. VibeOps continuously finds the cheapest reliable stack for every engineering decision. Demo Numbers below are illustrative; your benchmark gets generated during the private 100-PR replay.
| Stack | Recall | FP | $/PR | Latency | Decision |
|---|---|---|---|---|---|
Raw Claude review (no harness) Comments on style without surfacing contract gaps. | 74% | 31% | $1.20 | 28.0s | Too noisy |
Claude + VibeOps Integration Harness Contract reasoning + retry/idempotency evals. | 88% | 18% | $1.55 | 31.0s | Best quality |
Kimi-32B + VibeOps Integration Harness 17x cheaper, recall within 6pts. Route default for stable connectors. | 82% | 22% | $0.09 | 11.0s | Sweet spot |
SLM only (no harness) Misses contract gaps. Not safe alone. | 51% | 40% | $0.02 | 4.0s | Unsafe |
Cost vs Trust
Higher = better recall. Lefter = cheaper. Sweet spot is the upper-left.
Best quality
Sweet spot
Too noisy
Unsafe alone
Routing recommendation
For additive integration changes, VibeOps recommends the following stack mix. Routing is per-decision, not per-PR — different parts of the same review go to different tiers.
Kimi/SLM for basic checks → Claude on contract reasoning → owner on ambiguity
Avg cost reduction vs raw Claude
84%
Recall delta on goldset
−6pts
negligible on stable workflows
Latency improvement
2.8×
median p50
The moat · trust router stays right when models change
Model winners will keep changing. The leaderboard above will look different in 90 days — Kimi might catch Claude on reasoning, Gemini might dominate latency, a new SLM might own contract checks. VibeOps becomes the layer that always knows which model, with which harness, for which workflow — because the harnesses are ours, the historical replay is ours, and the trust accounting is workflow-aware. The model layer is a commodity. The trust router is not.