(Upper Level Ballroom 6A, San Diego Convention Center, December 6, 2025, Website)

Accepted Papers

  1. Tales from a Graph: a Pipeline for Mathematical Problem Generation
  2. Meta Thinker: Thinking What AI Thinks
  3. SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas
  4. Probabilistic Soundness Guarantees in LLM Reasoning Chains
  5. I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models
  6. Axiom-Aware FunSearch for Non-Constructive Mathematics
  7. CauSciBench: Assessing LLM Causal Reasoning for Scientific Research
  8. PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning
  9. Aryabhata: An exam-focused language model for JEE Math
  10. Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models
  11. Curiosity-driven RL for symbolic equation solving
  12. Decompose, Adapt, and Evolve: Towards Efficient Scientific Equation Discovery with Large Language Models
  13. Adaptive Coopetition: Leveraging Coarse Verifier Signals for Resilient Multi-Agent LLM Reasoning
  14. Infinite-Dimensional HiPPO Provides an Explicit Formula for LSSLs
  15. MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles
  16. ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
  17. Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead
  18. Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles
  19. Towards Scaling Laws for Symbolic Regression
  20. Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning
  21. Minif2f in Rocq: Automatic Translation Between Proof Assistants — A Case Study
  22. OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
  23. EchoRL: Learning to Plan through Experience for Efficient Reinforcement Learning
  24. Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs
  25. CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
  26. DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning
  27. RADAR: Reasoning–Ability and Difficulty-Aware Routing for Reasoning LLMs
  28. Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
  29. TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
  30. Decoupling Reasoning from Proving: A New Framework for Tackling Olympiad-Level Mathematics
  31. Analytical Lyapunov Function Discovery: An RL-based Generative Approach
  32. You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models
  33. PVSGym: A Proof Learning Environment
  34. ProxyThinker: Test-Time Guidance through Small Visual Reasoners
  35. Nested Depth Generalization in Transformers
  36. Learning Modular Exponentiation with Transformers
  37. \textsc{Gambit}: Generating Automated Mathematical Bounds, Inequalities, and Theorems
  38. Specifying exact circuit algorithms in universal transformers
  39. Tool-Assisted Multi-Turn Theorem Proving with LLMs
  40. A Small Math Model: Recasting Strategy Choice Theory in an LLM-Inspired Architecture
  41. Solving Inequality Proofs with Large Language Models
  42. Beyond Accuracy: Evaluating Multimodal Mathematical and Scientific Reasoning Through Error Analysis and Self-Correction
  43. Process-Verified Reinforcement Learning for Theorem Proving via Lean
  44. LLM-Generated Search Heuristics Can Solve Open Instances of Combinatorial Design Problems
  45. Expanding the Action Space of LLMs to Reason Beyond Language
  46. Single-stream Policy Optimization
  47. SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification
  48. CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process
  49. Can Large Language Models Learn Formal Logic? A Data-Driven Training and Evaluation Framework
  50. Why Reinforcement Learning Struggles with Expression Simplification: A Reward Analysis
  51. AI Impact on Human Proof Formalization Workflows
  52. One Token to Fool LLM-as-a-Judge
  53. Modeling Chain-of-Thought Collapse in Pruned Language Models: Fidelity and Similarity Analysis for Mathematical Reasoning
  54. LeanDojo-v2: A Comprehensive Library for AI-Assisted Theorem Proving in Lean
  55. From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization
  56. Learning Permuted Congruential Sequences with Transformers
  57. In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
  58. SciML Agents: Write the Solver, Not the Solution
  59. Concept Generalization in Humans and Large Language Models: Insights from the Number Game
  60. Hilbert: Recursively Building Formal Proofs with Informal Reasoning
  61. Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models
  62. FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory
  63. PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
  64. Mirage or Method? How Model–Task Alignment Induces Divergent RL Conclusions
  65. AI-Driven Mathematical Discovery for the Andrews–Curtis Conjecture
  66. DAG-Math: Graph-Guided Mathematical Reasoning in LLMs
  67. ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations
  68. A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models
  69. Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models
  70. On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
  71. R-Zero: Self-Evolving Reasoning LLM from Zero Data
  72. Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
  73. Controllable Mathematical Reasoning via Self-Optimizing Thought Vectors
  74. RLVR vs. Distillation: Understanding Accuracy and Capability in LLM Mathematical Reasoning
  75. Combining Textual and Structural Information for Premise Selection in Lean
  76. Improving autoformalization via cycle consistency and incremental type-checking using language-model probabilistic programs
  77. Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
  78. Adaptive Control for Test-time Scaling
  79. Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization
  80. AntiderivBench: Evaluating language models on indefinite integration
  81. Improving ML attacks on LWE with data repetition and stepwise regression
  82. Credit Cards, Confusion, Computation, and Consequences: How Well Do LLMs Reason About Financial Literacy?
  83. Kimina Lean Server: A High-Performance Lean Server for Large-Scale Verification
  84. Restructuring the Corpus Makes RAG Work for Math
  85. Numbers Already Carry Their Own Embeddings
  86. Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards
  87. OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
  88. Stoic Reasoner: Dual-Mode Transformers that Compress to Think and Decompress to Speak
  89. FoCus: Improving Faithfulness in Chain-of-Thoughts by Training on Structured Reasoning Data
  90. DELTA: How Does RL Unlock and Transfer New Algorithms in LLMs?
  91. Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection
  92. MathBode: Understanding LLM Reasoning with Dynamical Systems
  93. STAT: Skill-Targeted Adaptive Training
  94. SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers
  95. Learning How to Use Tools, Not Just When: Pattern-Aware Tool-Integrated Reasoning
  96. Babel-formal: Translation of Proofs between Lean and Rocq
  97. HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
  98. Usefulness-Driven Learning of Formal Mathematics
  99. Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
  100. Bridging Vision, Language, and Mathematics: Pictographic Character Reconstruction with Bézier Curves
  101. Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems
  102. VeriBench-FTP: A Formal Theorem Proving Benchmark in Lean 4 for Code Verification
  103. STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision
  104. HYBRIDMIND: Meta Selection of Natural Language and Symbolic Language for Enhanced LLM Reasoning
  105. StreetMath: Study of LLMs’ Approximation Behaviors
  106. Climbing the Ladder of Reasoning: What LLMs Can—and Still Can’t—Solve after SFT?
  107. Inpainting-Guided Policy Optimization for Diffusion Large Language Models
  108. How does RL induce skill composition? A Case Study using Countdown
  109. Blind Spot Navigation in Large Language Model Reasoning with Thought Space Explorer
  110. CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning
  111. Pretraining Scaling Laws for Generative Evaluations of Language Models
  112. RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval
  113. Reliable Fine-Grained Evaluation of Natural Language Math Proofs
  114. SPG: Sandwiched Policy Gradient for Mask Diffusion Language Models
  115. On the Evolution of Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
  116. Measuring Off-Trajectory Math Reasoning of LLMs
  117. AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
  118. Towards Understanding Self-play for LLM Reasoning
  119. Systematic Diagnosis of Brittle Reasoning in Large Language Models
  120. Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
  121. IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation
  122. Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
  123. Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces
  124. Automated Discovery of Conservation Laws via Hybrid Neural ODE-Transformers
  125. MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
  126. BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs
  127. Best-of-L: Cross-Lingual Reward Modeling for Mathematical Reasoning
  128. Understanding Tool-Integrated Reasoning
  129. CoDaPO: Confidence and Difficulty-Adaptive Policy Optimization for Language Models
  130. ProofGym: Unifying LLM-Based Theorem Proving Across Formal Systems
  131. Unspoken Logic: Understanding and bridging the gap between free-form and LLM-interpretable natural language mathematical proofs
  132. Evaluating Spatial Reasoning in Language Models
  133. Faults in our Formal Benchmarks
  134. CayleyPy Growth: Efficient growth computations and hundreds of new conjectures on Cayley graphs
  135. In Good GRACES: Principled Teacher Selection for Knowledge Distillation
  136. Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients
  137. Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training
  138. DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation
  139. ARM: Discovering Agentic Reasoning Modules for Mathematical Problem-Solving
  140. Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation
  141. Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning
  142. Reinforcement Learning for Hierarchical Proof Generation in Lean 4
  143. Exact Learning of Arithmetic with Differentiable Agents
  144. Think, Align, Select: Query–Key Scores for LLM Reasoning
  145. FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis
  146. Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute
  147. Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
  148. A NUMA Aware Compiler Framework for Large Scale Mathematical Reasoning Inference on PCIe Based Multi Accelerator Systems
  149. Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline
  150. Learning to Reason on Hard Problems with Privileged On-Policy Exploration
  151. Patching Gaps In LLM Reasoning With Interventional Training
  152. A Toolbox, Not a Hammer -- Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation
  153. RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows

The list of accepted papers can be found on OpenReview here.

Reviewers

We are grateful to our fantastic reviewers for making our workshop reviewing process run smoothly: