#1minPapers “Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective” — Zhiyuan Zeng et al
2 min read 3 days ago
Got something spicy today: a team at Fudan University in China attempted to reproduce o1. This paper was published 2 weeks ago, and full of useful nuggets. Highlighting 3 here:
- Ask the model to estimate how certain it is in its solutions — We see the same in AlphaFold in the form of pLDDT metrics. May be more than a mere measure of degree of relevancy in training data.
- Outcome Reward (throw out the entire answer if wrong, sparse data) was inferior to Process Reward (rewards intermediate steps, learns step-level policies, flexible step-level segmentation, information entropy).
- Sequential revisions — Similar to an internal editor, self-evaluation by the model.
- Learning rewards (Bradley-Terry model yet again) from preference signal via ranking multiple responses from LLMs to the same question. Key here is to ensure the preference data accurately reflects the actual performance of downstream tasks. Paradoxically, human preferences as supervision may degrade the true performance of the model.
- Reward shaping makes environmental signals more informative: train an LLM to self-correct to prevent learning collapse. However, the reward function estimate from one policy may not be valid for another policy. Incorporating inductive bias is useful here.
- Authors believe o1 is a “robust reward model trained on a large and diverse dataset spanning a wide range of domains. It can be adapted to a new domain easily through ground truth and solution pairs. Moreover, it is more likely to predict rewards by generating with LLM rather than through value heads.”
- On Superintelligence: “Reinforcement learning has the potential to achieve superhuman performance, since it learns from trial and error instead of human-expert data. While human-expert data captures human behavior and knowledge, reinforcement learning can lead to the discovery of strategies that humans may not be capable of.”
- Paper on arXiv: https://arxiv.org/abs/2412.14135