Lab News & Updates

February 23, 2026 QEDBench: Auditing LLMs as Mathematical Judges

We introduce QEDBench, a 272-problem benchmark that decouples mathematical proof generation from verification to reveal systemic limitations in frontier LLM reasoning. Our evaluation of 5 solvers and 7 LLM judges against 1,000+ hours of expert grading reveals a dangerous Sycophancy Trap and the Discrete-Continuous Reasoning Gap.

Read full post →