(Upper Level Ballroom 6A, San Diego Convention Center, December 6, 2025, Website)
Accepted Papers
- Tales from a Graph: a Pipeline for Mathematical Problem Generation
- Meta Thinker: Thinking What AI Thinks
- SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas
- Probabilistic Soundness Guarantees in LLM Reasoning Chains
- I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models
- Axiom-Aware FunSearch for Non-Constructive Mathematics
- CauSciBench: Assessing LLM Causal Reasoning for Scientific Research
- PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning
- Aryabhata: An exam-focused language model for JEE Math
- Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models
- Curiosity-driven RL for symbolic equation solving
- Decompose, Adapt, and Evolve: Towards Efficient Scientific Equation Discovery with Large Language Models
- Adaptive Coopetition: Leveraging Coarse Verifier Signals for Resilient Multi-Agent LLM Reasoning
- Infinite-Dimensional HiPPO Provides an Explicit Formula for LSSLs
- MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles
- ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
- Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead
- Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles
- Towards Scaling Laws for Symbolic Regression
- Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning
- Minif2f in Rocq: Automatic Translation Between Proof Assistants — A Case Study
- OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
- EchoRL: Learning to Plan through Experience for Efficient Reinforcement Learning
- Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs
- CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
- DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning
- RADAR: Reasoning–Ability and Difficulty-Aware Routing for Reasoning LLMs
- Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
- TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
- Decoupling Reasoning from Proving: A New Framework for Tackling Olympiad-Level Mathematics
- Analytical Lyapunov Function Discovery: An RL-based Generative Approach
- You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models
- PVSGym: A Proof Learning Environment
- ProxyThinker: Test-Time Guidance through Small Visual Reasoners
- Nested Depth Generalization in Transformers
- Learning Modular Exponentiation with Transformers
- \textsc{Gambit}: Generating Automated Mathematical Bounds, Inequalities, and Theorems
- Specifying exact circuit algorithms in universal transformers
- Tool-Assisted Multi-Turn Theorem Proving with LLMs
- A Small Math Model: Recasting Strategy Choice Theory in an LLM-Inspired Architecture
- Solving Inequality Proofs with Large Language Models
- Beyond Accuracy: Evaluating Multimodal Mathematical and Scientific Reasoning Through Error Analysis and Self-Correction
- Process-Verified Reinforcement Learning for Theorem Proving via Lean
- LLM-Generated Search Heuristics Can Solve Open Instances of Combinatorial Design Problems
- Expanding the Action Space of LLMs to Reason Beyond Language
- Single-stream Policy Optimization
- SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification
- CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process
- Can Large Language Models Learn Formal Logic? A Data-Driven Training and Evaluation Framework
- Why Reinforcement Learning Struggles with Expression Simplification: A Reward Analysis
- AI Impact on Human Proof Formalization Workflows
- One Token to Fool LLM-as-a-Judge
- Modeling Chain-of-Thought Collapse in Pruned Language Models: Fidelity and Similarity Analysis for Mathematical Reasoning
- LeanDojo-v2: A Comprehensive Library for AI-Assisted Theorem Proving in Lean
- From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization
- Learning Permuted Congruential Sequences with Transformers
- In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
- SciML Agents: Write the Solver, Not the Solution
- Concept Generalization in Humans and Large Language Models: Insights from the Number Game
- Hilbert: Recursively Building Formal Proofs with Informal Reasoning
- Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models
- FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory
- PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
- Mirage or Method? How Model–Task Alignment Induces Divergent RL Conclusions
- AI-Driven Mathematical Discovery for the Andrews–Curtis Conjecture
- DAG-Math: Graph-Guided Mathematical Reasoning in LLMs
- ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations
- A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models
- Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models
- On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
- R-Zero: Self-Evolving Reasoning LLM from Zero Data
- Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
- Controllable Mathematical Reasoning via Self-Optimizing Thought Vectors
- RLVR vs. Distillation: Understanding Accuracy and Capability in LLM Mathematical Reasoning
- Combining Textual and Structural Information for Premise Selection in Lean
- Improving autoformalization via cycle consistency and incremental type-checking using language-model probabilistic programs
- Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
- Adaptive Control for Test-time Scaling
- Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization
- AntiderivBench: Evaluating language models on indefinite integration
- Improving ML attacks on LWE with data repetition and stepwise regression
- Credit Cards, Confusion, Computation, and Consequences: How Well Do LLMs Reason About Financial Literacy?
- Kimina Lean Server: A High-Performance Lean Server for Large-Scale Verification
- Restructuring the Corpus Makes RAG Work for Math
- Numbers Already Carry Their Own Embeddings
- Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards
- OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
- Stoic Reasoner: Dual-Mode Transformers that Compress to Think and Decompress to Speak
- FoCus: Improving Faithfulness in Chain-of-Thoughts by Training on Structured Reasoning Data
- DELTA: How Does RL Unlock and Transfer New Algorithms in LLMs?
- Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection
- MathBode: Understanding LLM Reasoning with Dynamical Systems
- STAT: Skill-Targeted Adaptive Training
- SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers
- Learning How to Use Tools, Not Just When: Pattern-Aware Tool-Integrated Reasoning
- Babel-formal: Translation of Proofs between Lean and Rocq
- HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
- Usefulness-Driven Learning of Formal Mathematics
- Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
- Bridging Vision, Language, and Mathematics: Pictographic Character Reconstruction with Bézier Curves
- Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems
- VeriBench-FTP: A Formal Theorem Proving Benchmark in Lean 4 for Code Verification
- STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision
- HYBRIDMIND: Meta Selection of Natural Language and Symbolic Language for Enhanced LLM Reasoning
- StreetMath: Study of LLMs’ Approximation Behaviors
- Climbing the Ladder of Reasoning: What LLMs Can—and Still Can’t—Solve after SFT?
- Inpainting-Guided Policy Optimization for Diffusion Large Language Models
- How does RL induce skill composition? A Case Study using Countdown
- Blind Spot Navigation in Large Language Model Reasoning with Thought Space Explorer
- CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning
- Pretraining Scaling Laws for Generative Evaluations of Language Models
- RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval
- Reliable Fine-Grained Evaluation of Natural Language Math Proofs
- SPG: Sandwiched Policy Gradient for Mask Diffusion Language Models
- On the Evolution of Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
- Measuring Off-Trajectory Math Reasoning of LLMs
- AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
- Towards Understanding Self-play for LLM Reasoning
- Systematic Diagnosis of Brittle Reasoning in Large Language Models
- Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
- IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation
- Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
- Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces
- Automated Discovery of Conservation Laws via Hybrid Neural ODE-Transformers
- MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
- BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs
- Best-of-L: Cross-Lingual Reward Modeling for Mathematical Reasoning
- Understanding Tool-Integrated Reasoning
- CoDaPO: Confidence and Difficulty-Adaptive Policy Optimization for Language Models
- ProofGym: Unifying LLM-Based Theorem Proving Across Formal Systems
- Unspoken Logic: Understanding and bridging the gap between free-form and LLM-interpretable natural language mathematical proofs
- Evaluating Spatial Reasoning in Language Models
- Faults in our Formal Benchmarks
- CayleyPy Growth: Efficient growth computations and hundreds of new conjectures on Cayley graphs
- In Good GRACES: Principled Teacher Selection for Knowledge Distillation
- Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients
- Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training
- DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation
- ARM: Discovering Agentic Reasoning Modules for Mathematical Problem-Solving
- Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation
- Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning
- Reinforcement Learning for Hierarchical Proof Generation in Lean 4
- Exact Learning of Arithmetic with Differentiable Agents
- Think, Align, Select: Query–Key Scores for LLM Reasoning
- FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis
- Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- A NUMA Aware Compiler Framework for Large Scale Mathematical Reasoning Inference on PCIe Based Multi Accelerator Systems
- Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline
- Learning to Reason on Hard Problems with Privileged On-Policy Exploration
- Patching Gaps In LLM Reasoning With Interventional Training
- A Toolbox, Not a Hammer -- Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation
- RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows
The list of accepted papers can be found on OpenReview here.
Reviewers
We are grateful to our fantastic reviewers for making our workshop reviewing process run smoothly:
- Sumanth Varambally
- Haolin Liu
- George Tsoukalas
- Songlin Jiang
- Prahitha Movva
- Amitayush Thakur
- Zhong Chen
- Katherine M. Collins
- Pawan Sasanka Ammanamanchi
- Xiantao Zhang
- Shange Tang
- Anjiang Wei
- Simon Park
- Shashank Kirtania
- Hanning Zhang
- Nan Cheng
- Jiacheng Chen
- Hansem Sohn
- Sahil Khose
- Chenlu Ye
- Cedegao E. Zhang
- Kevin Scaria
- Peiyang Song
- Haoran Wang
- Feng Chen
- Roozbeh Yousefzadeh
- Yingru Li
- Jiyao Zhang
- Zhenshuo Zhang
- Yiquan Wang
- Chengshuai Shi
- Liao Zhang
- Gabriel Poesia
- Shubhra Mishra
- Wei Xiong
- Peter Chen
- Thomas Zhu
- Parsa Mirtaheri
- Haohan Zou
- Bartosz Piotrowski
- Emily Wenger
- Robert Joseph George
- Param Biyani
- Yifan Zhang
- Kaiyu Yang
- Lukas Münzel
- Yuan Sui
- Zhengying Liu
- Apurv Verma
- Ken Jiankun Zheng
- Pouria Mahdavinia
- Huanmi Tan
- Emile R Richard
- Sophia Simeng Han
- Yixiao Lin
- Coco Huang
- Tomáš Pevný
- Rui Min
- Jannis Limperg
- Shrishant Hattarki
- Alireza Farhadi
- Cyril Cohen
- Helen Jin
- Minju Gwak
- Shashank Sharma
- Hao Tang
- Somshubhra Roy
- Yang He
- Anirban Majumder
- Chiung-Yi Tseng
- Huaihai Lyu
- Sharv Murgai
- Bo Jin
- Kanan Gupta
- Charles L. Wang
- Tao Tao
- Yifeng Liu
- Disha Sheshanarayana
- Stefan Kuhlmann
- Yichen Huang
- Kamesh R
- Markus Norman Rabe
- Nigel Fernandez
- Pratik Prabhanjan Brahma
- Haoze Wu
- Shi Qiu
- Yujun Zhou
- Xiao-Wen Yang
- Wei Du
- Shreyas Krishnan
- Linfeng Song
- Shubhashis Roy Dipta
- Nathan Egbuna
- Hristo Papazov
- Bo Han
- Xuan Li
- Jaywon Koo
- Lyudmila Cheldieva
- Zakhar Kogan
- Arnav Hiray
- Binxin Gao
- Miroslav Marinov
- Qinghua Ding
- Caleb Lu
- Md Tanvirul Alam
- Fadoua Khmaissia
- Haocheng Ju
- Romir Patel
- Zilin Xiao
- Chuhan Li
- Rodrigo Mendoza Smith
- Feiyang Kang
- Giacomo Camposampiero
- Heiko Röglin
- Sewon Min
- Siddharth Mohapatra
- Yuxuan Jiang
- Anton Xue
- Steven Kelk
- Yuheng Wu
- Jasper Dekoninck
- Yongin Kwon
- Tu Nguyen
- xuying li
- Xinmu Ge
- Crystal Su
- Akiyoshi Sannai
- Jiajie Li
- Ryan Hsiang
- Johannes Schmitt
- Zizhuo Zhang
- Weizhe Chen
- Ralph Abboud
- Srivatsava Daruru
- Anubhav Shrestha
- Jannik Brinkmann
- Gergely Berczi
- Taneesh Gupta
- Fedor Levkovich-Maslyuk
- Ashish Tiwari
- Prakhar Dixit
- Lei Ding
- Roman Bresson
- Jiaming Cui
- Jasraj Budigam
- Se-Young Yun
- MEIJIA CHEN
- Chenwei Xu
- Qian Xie
- Weston Kirk
- Lucas Paul Saldyt
- Chen Li
- Dongyoon Han
- Dylan Possamaï
- David Holmes
- Caitlyn H. Yin
- Rylan Schaeffer
- Yiyou Sun
- Dmytro Fedoriaka
- Huaibo Chen
- Hao Liang
- Jonas Bayer
- Guan-Yi Lin
- Marek Suppa
- Aochong Oliver Li
- Heng Lin
- Aashwin Ananda Mishra
- Ivo Petrov
- Junyu Zhang
- Zongxia Li
- Andrew Kim
- Nina Narodytska
- Tianyu Liu
- Benjamin Zelenskiy
- Yuan Xia
- Tasha Kim
- Saanidhya Vats
- Christopher D. Rosin
- Saibilila Abudukelimu
- Hamed Mahdavi
- Cheng Ge
- Tianyi Xu
- Kazumi Kasaura
- Sho Sonoda
- Siddharth Bhat
- Blessing Effiong
- Justin Yang Chae
- Danyang Zhang
- Runpeng Dai
- Jinghan Zhang
- Imran Nasim
- Nelson Vadori
- Ali Ghiasvand Mohammadkhani
- Sadegh Mahdavi
- Liu Yang
- Dmitry Zubarev
- Chung-En Sun
- Karel Chvalovsky
- Vivan Doshi
- Yingheng Wang
- Sergey Galkin
- Gene Yang
- Shanda Li
- Zeyu Zheng
- Andrew Shin
- Shubham Toshniwal
- Théo Stoskopf
- Yue Hu
- Junyu Guo
- Kazem Meidani
- Han Wang
- Hangrui Bi
- Haoze Wu
- Chengsong Huang
- Prakamya Mishra
- Joshua Mitton
- Gregory Senay
- Marc Lelarge
- Ruize Xu
- Zhongwen Xu
- Manoj Acharya
- Hins Hu
- Qi Cao
- Lechen Zhang
- Alberto Alfarano
- Minsu Kim
- Randy Davila
- Jalal Naghiyev
- Xiang Zheng
- Yash Yardi
- Meghaj Tarte
- Per Kristian Lehre
- Zhanke Zhou
- Maryam Rostamigiv
- Jiaowen Yang
- Jingzhe Shi
- Joohyun Lee
- David Demitri Africa
- Slim Barkallah
- Dong Liu
- Aarush Gupta
- Shuvendu Roy
- Paimon Goulart
- Renjie Cao
- Jingjun Han
- Charles Arnal
- Leroy Z. Wang
- Qianchuan Zhao
- Mark Obozov
- Ningning Xu
- Udaya Ghai
- Yifan Wu
- Vardaan Gangal