VibeOps Autonomy Lab
Trust infrastructure for autonomous engineering
Demo · VibeCorp Engineering · public-style PR historyRequest private 100-PR Replay
Model + Harness Leaderboard · per workflow

The best model depends on the workflow.

You don't bet against Claude. You route Claude where Claude is needed, and run cheaper stacks where the harness proves they're safe. VibeOps continuously finds the cheapest reliable stack for every engineering decision. Demo Numbers below are illustrative; your benchmark gets generated during the private 100-PR replay.

StackRecallFP$/PRLatencyDecision
Raw Claude review (no harness)
Comments on style without surfacing contract gaps.
74%31%$1.2028.0sToo noisy
Claude + VibeOps Integration Harness
Contract reasoning + retry/idempotency evals.
88%18%$1.5531.0sBest quality
Kimi-32B + VibeOps Integration Harness
17x cheaper, recall within 6pts. Route default for stable connectors.
82%22%$0.0911.0sSweet spot
SLM only (no harness)
Misses contract gaps. Not safe alone.
51%40%$0.024.0sUnsafe
Cost vs Trust
Higher = better recall. Lefter = cheaper. Sweet spot is the upper-left.
40%60%80%100%cost per PR (log scale, $)$1$2$0.09$0.02
Best quality
Sweet spot
Too noisy
Unsafe alone

Routing recommendation

For additive integration changes, VibeOps recommends the following stack mix. Routing is per-decision, not per-PR — different parts of the same review go to different tiers.

Kimi/SLM for basic checks → Claude on contract reasoning → owner on ambiguity
Avg cost reduction vs raw Claude
84%
Recall delta on goldset
−6pts
negligible on stable workflows
Latency improvement
2.8×
median p50
The moat · trust router stays right when models change
Model winners will keep changing. The leaderboard above will look different in 90 days — Kimi might catch Claude on reasoning, Gemini might dominate latency, a new SLM might own contract checks. VibeOps becomes the layer that always knows which model, with which harness, for which workflow — because the harnesses are ours, the historical replay is ours, and the trust accounting is workflow-aware. The model layer is a commodity. The trust router is not.