DeepSeek-R1: pure reinforcement learning (RL), no supervised fine-tuning (SFT), no chain-of-thought (CoT)

Gwen Cheni
Jan 21, 2025

--

#1minPapers

In addition to open source, DeepSeek-R1 is significant because it’s complete reinforcement learning (RL), no supervised fine-tuning (SFT)(“cold start”). Reminiscent of AlphaZero (which mastered Go, Shogi, and Chess from scratch, without playing against human grandmasters).

The secret sauce is rewards: ground truth computed by hardcoded rules. Learned rewards can easily be hacked by RL.

Uses Group Relative Policy Optimization (GRPO) instead of Proximal Policy Optimization (PPO): foregoes the critic model that is typically the same size as the policy model, instead estimates the baseline from group scores, using the average reward of multiple samples to reduce memory use.

Emergent properties:

  • Thinking time steadily improved throughout the training process:

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Gwen Cheni
Gwen Cheni

Written by Gwen Cheni

Building stealth AI+bio. Prev @KhoslaVentures @indbio @sosv🧬💻 @ucsf🌉 @jpmorgan @GoldmanSachs @yale @UChicago @LMU_Muenchen https://linktr.ee/gwencheni

No responses yet

Write a response