If you look at their excellent paper & code, the reward model is a logical function that was handcrafted & progammed by engineers.
— Chomba Bupe (@ChombaBupe) February 1, 2025
DeepSeek RL approach is impressive in the sense that it reduces the need for tedious supervised fine tuning (SFT) but isn't really general.
Sunday, February 2, 2025
DeepSeek's approach only works in limited technical domains
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment