As the other commenter said, R1 required very standard RLHF techniques too.
But a fun way to think about it is that reasoning models are going to be bigger and uplift the RLHF boat.
But we need a few years to establish basics before I can write a cumulative RL for LLMs book ;)
But we need a few years to establish basics before I can write a cumulative RL for LLMs book ;)