VibeOps Research

Verifier red teaming for AI labs and enterprises. We measure where LLM-as-judge fails, then harden it.

Latest

May 2026

Published in collaboration with Biostack. Claude Opus 4.7 misses explicit negations on HealthBench 83% of the time. We drove judge hack rate from 35% to 1.7% on out-of-distribution biomedical QA, retaining 81% of medical reasoning capability.

Read →

Jun 2026

Coming next.

What we do

Three workstreams:

·Adversarial probe libraries for LLM judges and reward models. Mutation-resistant. Regenerated quarterly to defend against contamination.
·Judge hardening via QLoRA SFT. Reproducible end-to-end on one H100 in under an hour. Hardened LoRA adapter delivered.
·Continuous monitoring. Weekly probe regeneration plus drift dashboards. Catches reintroduced regressions when judge model versions update.

Open infrastructure

biohart

Cross-vendor leaderboard for biomedical-QA judges. 12 production models scored across HealthBench, PubMedQA, BioASQ, and MIMIC. Mutation engine open.

goodhart

Cross-vendor leaderboard for code-RL verifiers. 28 production models scored. Multi-axis reward scoring open.

Work with us

We engage with AI labs and enterprises on verifier red teaming and judge hardening.

hi@vibeops.tech