VibeOps Research

Verifier red teaming for AI labs and enterprises. We measure where LLM-as-judge fails, then harden it.

Latest

May 2026
Medical AI’s Trusting Trust Problem

Published in collaboration with Biostack. Claude Opus 4.7 misses explicit negations on HealthBench 83% of the time. We drove judge hack rate from 35% to 1.7% on out-of-distribution biomedical QA, retaining 81% of medical reasoning capability.

Read →
Jun 2026
Coming next.

What we do

Three workstreams:

  • ·Adversarial probe libraries for LLM judges and reward models. Mutation-resistant. Regenerated quarterly to defend against contamination.
  • ·Judge hardening via QLoRA SFT. Reproducible end-to-end on one H100 in under an hour. Hardened LoRA adapter delivered.
  • ·Continuous monitoring. Weekly probe regeneration plus drift dashboards. Catches reintroduced regressions when judge model versions update.

Open infrastructure

biohart

Cross-vendor leaderboard for biomedical-QA judges. 12 production models scored across HealthBench, PubMedQA, BioASQ, and MIMIC. Mutation engine open.

goodhart

Cross-vendor leaderboard for code-RL verifiers. 28 production models scored. Multi-axis reward scoring open.

Work with us

We engage with AI labs and enterprises on verifier red teaming and judge hardening.

hi@vibeops.tech