VibeOps Autonomy Lab
Trust infrastructure for autonomous engineering
Demo · VibeCorp Engineering · public-style PR historyRequest private 100-PR Replay
Senior Judgment Compiler · the moat

Senior judgment → deterministic trust system.

Type a rule the way a staff engineer would say it in standup. VibeOps compiles it into a trust policy, a runtime check, an eval case, a model route, an exception rule, a certificate section, and a historical-replay benchmark. Demo An internal team can write a Claude skill — they cannot easily write the system around it.

Senior rule (natural language)
Walk through the canned artifacts. The live compiler runs inside your VPC after onboarding.
Policy
Eval
Check
Route
Cert
Replay
What you get back
One sentence in. Six artifacts out. Every artifact is versioned, audited, and re-runs against your last 90 days of history so you see — before you ship the rule — what would have changed.
Trust policy
Versioned. Auditable.
Eval case
Permanent in CI.
GitHub check
Pre-merge gate.
Model route
Per-tier.
Cert section
Surfaces in every PR.
Replay benchmark
Backtested on 90d.

Compiled artifacts

Each tab shows one artifact the rule above generates. All six are produced together and stay in sync.

Trust policy
Compiled
policy IntegrationRetryChange:
  applies_when:
    workflow      = integration
    files_touch   = api_client | connector
    diff_contains = retry | timeout | backoff
  require:
    idempotency_test           = true
    timeout_behavior_evidence  = true
    contract_test              = true
  route:
    basic_policy_checks  → slm
    integration_reasoning → frontier_model
    ambiguity             → human_owner (@connectors-team)
  decision:
    all_evidence_present → candidate_for_auto_approval
    else                 → escalate
Eval case
Compiled
eval IntegrationRetryChange.v1:
  goldset:    42 historical PRs (87 retry-touching)
  predicates: idempotency_evidence_detector,
              timeout_behavior_detector,
              contract_test_presence
  scoring:    macro_f1
  baseline:   raw_claude  → recall 0.74, fp 0.31
  target:     claude+pack → recall 0.88, fp 0.18
  failure_modes:
    - "retry without idempotency proof"
    - "timeout unspecified on 5xx storm"
    - "contract test absent on payload schema"
GitHub check
Compiled
vibeops/trust · integration-retry-change · required
idempotency_test detected
timeout_behavior_evidence missing
contract_test missing
·awaiting @connectors-team approval
Model route
Compiled
1
SLM (Kimi-32B)
Surface checks: naming, ownership, files-touched policy. $0.003 / PR.
2
Frontier (Claude Opus 4.7)
Contract reasoning over retry semantics + idempotency contract. $0.42 / PR.
3
Human owner
Only when contract proof remains ambiguous. ~12% of triggered PRs.
Trust Certificate section
Compiled
Required proof · integration retry change
  • · Idempotency test
  • · Timeout behavior evidence
  • · Contract test
Auto-renders in every certificate where this policy applies.
Replay benchmark · 90 days
Compiled
Before rule
74%
recall · 31% fp
After rule
88%
recall · 18% fp
Net catches
+14
contract gaps surfaced
Failure Atlas · what compiles into Trust Packs
VibeOps has accumulated patterns of how agentic engineering fails. Each pattern feeds the relevant Trust Pack's evals. Private customer data stays private; generalized patterns improve every deployment.
Missing backward compatibilityContract harness
Fake test coverage (AI-written)Mutation eval
Missing rollback pathMigration Pack rollback gate
Flag boundary leakBoundary classifier
Duplicate abstractionRepo graph + semantic match
Permission boundary ambiguityAppSec policy harness
Trust Pack Library · the network effect
Each pack ships pre-built policies, evals, model routes, and certificate sections — built from 420+ replayed PRs across customer deployments. Activate one for the workflow that matters most.
Integration Change Pack420 PRs · 36 evals
available
Frontend Critical Flow Pack312 PRs · 24 evals
available
Migration Pack188 PRs · 18 evals
available
Feature Flag Rollout Pack154 PRs · 14 evals
available
Auth/RBAC Pack142 PRs · 22 evals
available
Core Infra Pack96 PRs · 16 evals
preview