Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How Deepseek R1 Was Trained (philschmid.de)
31 points by amrrs 10 months ago | hide | past | favorite | 3 comments


Deepseek R1 paper that the blogpost is written around: https://arxiv.org/pdf/2501.12948


Can someone dumb to me, a generalist engineer who has a very surface level knowledge of how training LLMs work: what people were doing before and what GRPO is doing different?


They were using techniques like PPO that have a model (like a critic!) that evaluates whether the new model gives accurate responses. With GRPO, the don't have that and instead evaluate the answers based on predefined rules like coherence/ formatting. For example, for math problems, these rules will check if the answer adheres to math principles or logic!

I wrote more here (lmk if this is useful): https://www.vellum.ai/blog/the-training-of-deepseek-r1-and-w...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: