MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng, Pengyu Zhao, Yu Cheng
MaxProof scales competition-level mathematical proof by searching over a population of candidates using verifier-guided refinement.
How can we scale mathematical proof performance by using a population-based, iterative refinement framework that combines generative proof-writing, verifier-guided critique, and tournament-based selection?
Mathematical proofs are brittle; a single logical gap or hand-waved step invalidates the entire argument, making standard generative models prone to reward-hacking and unreliable reasoning. MaxProof addresses this by treating proof generation as an evolutionary search: it maintains a population of candidate proofs, uses a conservative generative verifier to score them, and iteratively repairs flawed candidates using critique-conditioned refinement. With this test-time scaling, the M3 model achieves gold-medal performance on both IMO 2025 and USAMO 2026, significantly outperforming its standalone generation capability.
Paper Primer
The framework hinges on three specialized capabilities—Proof, Verifier, and Fixer Experts—merged into a single model. The core move is a population-level loop: it samples candidates, uses a pessimistic verifier to filter and critique them, and applies dual PATCH/REWRITE operators to evolve the population toward correct proofs.
MaxProof enables gold-medal performance on competition-level mathematics.
The M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026. Exceeds the human gold-medal threshold on both contests.
The verifier is the system's anchor, utilizing a four-layer defense-in-depth pipeline (filtering, normalization, multi-judge scoring, and pessimistic aggregation) to suppress false positives that would otherwise be amplified by reinforcement learning.
Why is a population-level search necessary instead of just sampling more proofs?
Sampling alone relies on the model's initial best@K capability, whereas MaxProof actively improves promising but flawed candidates through critique-conditioned refinement, effectively raising the population ceiling.
How does the system avoid the reward-hacking common in long-horizon RL?
The verifier is designed to be intentionally conservative, using pessimistic aggregation to favor false negatives over false positives, and the training pipeline explicitly filters out groups with low reward variance to prevent the model from learning arbitrary score differences.
For complex reasoning tasks, performance gains are increasingly found in inference-time search and iterative refinement rather than just scaling base model parameters.
Introduction to MaxProof
Introducing MaxProof, a population-based test‑time scaling approach to overcome single‑pass proof generation limits.
Mathematical proof demands strict logical consistency, unlike open‑ended generation where occasional hand‑waving is tolerated. A single‑pass generation process cannot reliably satisfy the long, tightly coupled chain of constraints required for competition‑level proofs.
To overcome this bottleneck, MaxProof treats proof writing as an iterative, population‑based search: candidates are critiqued, repaired, and ranked until a correct solution emerges.
The MiniMax‑M3 (M3) series first trains three proof‑oriented capabilities—proof generation, proof verification, and critique‑conditioned repair—using a defense‑in‑depth generative verifier that minimizes false‑positive feedback.
Despite rapid progress from systems such as AlphaGeometry, AlphaProof, Gemini Deep Thinking, OpenAI’s frontier models, DeepSeek‑Math‑V2, SU‑01, NVIDIA Nemotron Cascade2, and GPT‑5.5, competition‑level proof still requires sharper design choices because a proof must be error‑free.
Our pipeline employs a four‑layer generative‑verifier architecture (bad‑case filtering, solution normalization, multi‑judge parallel scoring, pessimistic min aggregation) whose primary goal is a low false‑positive rate rather than raw accuracy. The M2 cycle taught us that a single‑judge verifier quickly leads to reward‑hacking plateaus; we therefore harden the system against length bias, format hacking, semantic shortcuts, and judge‑specific preferences.
**Figure 1.** The MaxProof pipeline. M3 first trains three proof-oriented capabilities—proof generation through verifier-guided proof RL, proof verification through aligned error finding, and critique-conditioned proof repair through refinement augmentation. These capabilities are merged into the M3 release model, which MaxProof scales at test time through population search and tournament selection.
With MaxProof, the same M3 model attains 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing the human gold‑medal threshold and demonstrating that population‑level test‑time scaling can close the gap left by single‑pass generation.
The key insight is that shifting from single‑pass generation to iterative, population‑based refinement enables reliable, competition‑level proof synthesis.
The Proof Expert
We equip the model with a frozen verifier that turns proof drafts into a reliable RL reward.
Long‑horizon proof generation needs a reward that reflects mathematical correctness, but proofs lack an executable ground truth. A naïve verifier that only checks final answers would let the policy exploit superficial cues.
The Proof Expert couples a frozen, multi‑layer verifier with a clipped‑policy RL loop to turn whole‑proof drafts into a scalar reward.
Layer 1 discards drafts that are empty or exceed a length budget; this draft passes.
Layer 2 removes the fixed opening “Assume” and the redundant final line, yielding the core statement “$a^2=b^2$”.
Layer 3 runs three judges: two rubric judges find the equality matches the target rubric, the no‑rubric judge flags the missing justification and assigns score 2.
Layer 4 takes the minimum score across judges, producing a final reward of 2 (out of 7).
The group‑level advantage $A_i$ is computed by subtracting the group mean and dividing by the group standard deviation; if the group variance exceeds $\tau_{\text{std}}$, the update proceeds.
The min‑aggregation prevents a single generous judge from inflating the reward, while the early filters keep the RL signal focused on mathematically meaningful content.
Why does the Proof Expert use three parallel judges instead of a single, stronger verifier?
Because a single judge can be gamed by learning its idiosyncratic preferences; multiple judges expose disagreements, and the pessimistic min forces the policy to satisfy the strictest judge, reducing reward‑hacking.
**Figure 2.** The training dynamics of proof expert.
**Figure 3.** The verifier pipeline as four defensive layers. The first two layers remove format-driven failure modes; the last two produce a conservative scalar reward.
The Verifier Expert
Fine‑grained proof verification replaces coarse score prediction for reliable reasoning.
Predicting a single 0–7 score ignores where a proof fails; a model can cheat by learning surface correlations without truly reading the argument.
Think of a spell‑checker that underlines every misspelled word instead of giving an overall “readability” grade – the Verifier Expert forces the model to point out each logical flaw.
The model writes
It populates
Because an error is listed, the
The structured report forces the model to read each line; a regression model could never be penalized for missing the specific mistake.
How does the Verifier Expert differ from a simple 0–7 score predictor?
Score prediction only outputs a scalar and can succeed by memorizing surface patterns. The Verifier Expert must enumerate every error and tie the final verdict to that enumeration, which requires genuine comprehension of the proof’s logical flow.
Training data are harvested from the Proof Expert’s own verifier runs: every candidate proof already carries an // triple, which we reuse verbatim.
The harvested set is dominated by “`no_errors`” and “`has_errors`” (≈ 65 % total); the intermediate “`minor_gaps`” and “`fundamentally_wrong`” classes are under‑represented, so we rebalance them to prevent collapse to the extremes.
$R_{\text{error}}$ is computed by a frontier LLM judge that checks both spatial localization (the correct step) and semantic description (the right failure type).
$R_{\text{verdict}}$ assigns reward 1 for exact class match, 0.5 for a one‑step distance (e.g., “`minor_gaps`” vs “`has_errors`”), and 0 otherwise, reflecting the natural ordering `no_errors` < `minor_gaps` < `has_errors` < `fundamentally_wrong`.