QEDBench: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs
As Large Language Models (LLMs) achieve saturation on elementary mathematical benchmarks like GSM8K and MATH, the research frontier has shifted from generation to the reliability of automated evaluation. While models can now solve high-school competition problems, evaluating intricate proofs at the upper-undergraduate to early graduate level remains an open alignment challenge.
Our recent research demonstrates that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap. To quantify this, we are excited to introduce QEDBench, the first large-scale dual-rubric alignment benchmark designed to systematically measure alignment with human experts on university-level math proofs.
The QEDBench Framework
Unlike prior works that implicitly tie evaluation quality to model performance, we explicitly decouple these two axes to provide a systematic audit of university proof generation.
QEDBench comprises 272 expert-curated, proof-based mathematical problems spanning 10 core upper-undergraduate/early-graduate disciplines: Analysis, Complex Analysis, Abstract Algebra, Discrete Math, Probability Theory, ODEs, Number Theory, Combinatorics, Algorithms, and Graph Theory.
Our experimental design followed a strict three-stage pipeline to establish a Human Expert Ground Truth:
- Solution Generation: 5 frontier models (o3-deep-research, GPT-5 Pro, Claude Sonnet 4.5, Gemini 3.0 Pro, DeepSeek-Prover-V2) generated proofs.
- Expert Annotation & Rubric Engineering: 48 PhD-level experts evaluated these solutions across 1,000+ hours. We contrasted evaluations using Course-Specific Rubrics (pedagogical standards) against Expert-Domain Rubrics (research standards).
- Judge Calibration: 7 judge models graded the generated proofs, allowing us to compare AI grading against Human Ground Truth.
Key Finding 1: The Discrete-Continuous Divide
When evaluating frontier models against our strict human ground truth, we discovered a massive chasm in performance depending on the mathematical domain.
Performance does not degrade uniformly; rather, it correlates with the necessity of constructive search. In domains like ODEs and Complex Analysis, where proofs often follow a procedural "recipe," frontier models achieve near-saturation (e.g., 100.0% pass rates for Gemini 3.0 Pro and Claude Sonnet 4.5 in ODEs).
However, in domains requiring the construction of novel objects (e.g., bijections or counter-examples), we see catastrophic failure rates. In Combinatorics, Claude Sonnet 4.5 achieved only 27.3%. This suggests that while models excel at retrieving theorem templates, they struggle to search finite state spaces for constructive proofs.
"While Gemini 3.0 Pro achieves state-of-the-art overall performance across most categories (0.91 raw score), other frontier reasoning models like Claude Sonnet 4.5 see their performance significantly degrade in the discrete domains."
Key Finding 2: The Evaluator Alignment Gap
Do off-the-shelf LLMs perform genuine logical verification in abstract domains that align with expert human evaluators? No.
We demonstrate that standard LLM judges exhibit a systematic positive bias. Our dual-evaluation matrix reveals that frontier evaluators like Claude Opus 4.5 inflate scores by up to +0.18 in abstract domains. Similarly, Llama 4 Maverick acts as an extreme "grade inflator," passing 90.2% of solutions against a 67.7% human baseline.
We term this phenomenon the Sycophancy Trap: evaluator models systematically reward authoritative-sounding LaTeX formatting and plausible structural setups despite the presence of catastrophic logical flaws. Evaluator models act leniently in Discrete domains, but become punishingly rigid in Continuous domains like Complex Analysis, revealing a bidirectional alignment gap.
Key Finding 3: Rubric Insensitivity and The Knowledge Gap
A common theory suggests that LLM judges fail primarily due to underspecified instructions. We tested this by forcing judges to use the stricter Course-Specific Rubric, which bans advanced machinery without derivation.
We found a phenomenon of Rubric Insensitivity. Despite adding rigid constraints, the alignment metrics of judges like GPT-5.2 Pro remained virtually unchanged. LLM judges are pathologically lenient and systematically fail to enforce negative constraints, ignoring explicit rubric directives.
Furthermore, solver models exhibit a massive Knowledge Gap. When faced with complex proofs, rather than engaging constructively, models often default to retrieving advanced machinery or hallucinating constraints. Human experts punish this lack of rigor, while AI judges wave it through.
The Path Forward
QEDBench demonstrates that the bottleneck in automated mathematical reasoning has formally shifted from generation to verification.
Standard "LLM-as-a-Judge" pipelines do not merely miss errors; they actively reward persuasive "hallucinated rigor." As long as AI checks AI using standard context-window prompting, we risk a feedback loop of deceptive alignment. We hope the release of the public QEDBench dataset will accelerate the future of process-based supervision and reliable automated reasoning.
← Back to Blog Index