#1minPapers MSFT’s rStar-Math small language model self-improves and generates own training data
This is the second time in recent months that a small model performed equally well (or better) than the billion-parameter large models. Granted math problems are unique: mostly quantifiable and verifiable.
“Unlike solutions relying on superior LLMs for data synthesis, rStar-Math leverages smaller language models (SLMs) with Monte Carlo Tree Search (MCTS) to establish a self-evolutionary process, iteratively generating higher-quality training data.”
Result: “4 rounds of self-evolution with millions of synthesized solutions for 747k math problems … it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%.”
Process reward modeling (PRM) provides fine-grained feedback on intermediate steps because incorrect intermediate steps significantly decrease data quality in math.
SLM samples candidate nodes, each generating a one-step CoT and the corresponding Python code. Only nodes with successful Python code execution are retained, mitigating errors in intermediate steps. MCTS automatically assign (self-annotate) a Q-value to each intermediate step based on its contribution: steps contributing to more trajectories that lead to the correct answer are given higher Q-values and considered higher quality.
SLM as a process preference model (PPM) to predict reward labels for each math reasoning step. Although Q-values are not precise, they can reliably distinguish positive (correct) steps from negative (irrelevant/incorrect) ones. Using preference pairs and pairwise ranking loss, instead of directly using Q-values as reward labels, eliminate the inherently noise and imprecision in stepwise reward assignment.
Paper on arXiv: https://arxiv.org/abs/2501.04519