If you look at their excellent paper & code, the reward model is a logical function that was handcrafted & progammed by engineers.
— Chomba Bupe (@ChombaBupe) February 1, 2025
DeepSeek RL approach is impressive in the sense that it reduces the need for tedious supervised fine tuning (SFT) but isn't really general.
No comments:
Post a Comment