Andrej Karpathy's Popular Science on RLHF

Among the three main stages of large language model (LLM) training, Reinforcement Learning from Human Feedback (RLHF) is the final phase, following pre-training and supervised fine-tuning (SFT). Karpathy's criticism of RLHF is that, although it is considered a form of reinforcement learning, it is far from as powerful as RL.

In the training process of large language models (LLMs), there are three key stages. Among them, Reinforcement Learning from Human Feedback (RLHF) is the last stage, following pre-training and supervised fine-tuning (SFT). The renowned artificial intelligence expert Karpathy has criticized RLHF. In his view, although RLHF is seen as a form of reinforcement learning, it has a significant gap compared to truly powerful reinforcement learning.

Compared to traditional reinforcement learning, RLHF clearly lacks some powerful features. A key difference is that the reward model of RLHF is merely an imitation of human preferences, rather than a genuine problem-solving objective. This means that RLHF may lack a clear direction and depth when guiding the model to optimize.

In addition, RLHF has two fundamental issues. First, the reward model may mislead the optimization process. Since the reward model is built based on human preferences, it may not accurately reflect the essence of the problem and the optimal solution, leading the model in the wrong direction during optimization. Second, the model is prone to gaming the reward model, finding adversarial examples. This means the model may obtain rewards through improper means rather than genuinely improving its performance and problem-solving abilities.

Despite these limitations, it cannot be denied that RLHF still has certain benefits in building large language model assistants. This is because RLHF fully utilizes the advantages of human annotators in selecting the best answers. Human annotators, with their knowledge, experience, and judgment, can provide more accurate and valuable feedback to the model, helping the model better understand task requirements and user needs.

RLHF also helps reduce the phenomenon of hallucination in large language models. By supervising large language models with the reward model, it can teach the model to avoid incorrect factual knowledge. When the model outputs incorrect information, the reward model can give penalties, prompting the model to continuously adjust and optimize its output to improve accuracy and reliability.

However, at present, effective practical reinforcement learning has not been achieved in open-domain problem-solving tasks. This means that RLHF and existing reinforcement learning methods still face significant challenges in dealing with complex, open-ended problems. Further research and innovation are needed to improve the effectiveness and practicality of reinforcement learning in open-domain problem-solving.