May 2026
Medical AI’s Trusting Trust Problem
Published in collaboration with Biostack. Claude Opus 4.7 misses explicit negations on HealthBench 83% of the time. We drove judge hack rate from 35% to 1.7% on out-of-distribution biomedical QA, retaining 81% of medical reasoning capability.
Read →