Using GPT-4 to Find Errors in GPT-4

Research by Jan Leike, the former head of alignment at OpenAI, who has since left the company. Reinforcement Learning from Human Feedback (RLHF) is inherently limited by the human ability to evaluate models—it does not scale well. The purpose of scalable supervision is to solve this problem by using AI to assist humans in evaluation. The bathroom tried the simplest idea: training a critic to point out flaws. Intuitively, confirming a flaw should be easier than finding it. In practice, a code critic trained with RLHF found more errors than human trainers. It can even find flaws in a quarter of ChatGPT's production data that humans rated as perfect (not limited to code). Study the code because it is a practical task that the current model can help solve, but these techniques can be applied to any task.

We trained a GPT-4 based model called CriticGPT to capture errors in ChatGPT's code output. We found that when people review ChatGPT code with the help of CriticGPT, their performance is 60% better than those without help. We are working on integrating models similar to CriticGPT into our RLHF labeling pipeline to provide explicit AI assistance to our trainers. This is a step towards being able to evaluate the output of advanced AI systems, which may be difficult for people to score without better tools.

The GPT-4 series models that support ChatGPT achieve practicality and interactivity through "reinforcement learning from human feedback" (RLHF). A key part of RLHF is collecting comparison results, where AI trainers rate different ChatGPT responses.

As we make progress in reasoning and model behavior, ChatGPT becomes more accurate, and its errors become more subtle. This may make it difficult for AI trainers to find errors, making the comparison task that supports RLHF more difficult. This is a fundamental limitation of RLHF, as the model gradually becomes more knowledgeable than anyone who can provide feedback, it may become increasingly difficult to align the model.

To meet this challenge, we trained CriticGPT to write criticism highlighting inaccuracies in ChatGPT's answers.

Using startswith() to check if the absolute path of a file is in a directory is unsafe. Users can exploit this by using symbolic links or similarly named directories. It is recommended to use os.path.commonpath([absolute_file_path, absolute_safe_dir]) or more robust path control methods.

CriticGPT's suggestions are not always correct, but we found that they can help trainers find more problems in the model's written answers than without AI assistance. In addition, when people use CriticGPT, AI enhances their skills, resulting in more comprehensive criticism than when people work alone, and fewer hallucination errors than when the model works alone. In our experiments, a second random trainer preferred criticism from the Human+CriticGPT team over criticism from unassisted humans more than 60% of the time.

The graph shows two bar charts comparing Human, CriticGPT, and Human + CriticGPT. The left chart measures the comprehensiveness of criticism, and the right chart measures hallucination issues (the lower the better). Each bar chart has an error line indicating variance.

CriticGPT can help trainers write more comprehensive comments than without assistance, while producing fewer hallucinations than model-based comments alone.

Method
Like ChatGPT, CriticGPT is also trained with RLHF. However, unlike ChatGPT, it sees a large number of inputs containing errors and must criticize these errors. We asked AI trainers to manually insert these errors into code written by ChatGPT, and then write example feedback as if they had just discovered the errors they inserted. Then, the same person compared multiple criticisms of the modified code so that they could easily determine whether the criticism had discovered the errors they had inserted. In our experiments, we studied whether CriticGPT could capture the inserted errors and the "naturally occurring" ChatGPT errors previously discovered by the trainers. We found that in 63% of naturally occurring error cases, trainers preferred CriticGPT criticism over ChatGPT criticism, partly because the new critic produced fewer "picky" (unhelpful minor complaints) and fewer hallucination issues.

We also found that by giving the criticism reward model additional test time search, we could generate longer and more comprehensive criticisms. This search process allows us to balance our aggressiveness in finding problems in the code and configure the precision-recall trade-off between hallucinations and the number of detected errors. This means we can generate criticism that is as helpful as possible for RLHF. For more details, please refer to our research paper.

Limitations
We trained CriticGPT with short answers from ChatGPT. To supervise future agents, we need to develop methods that can help trainers understand long and complex tasks.

The model still produces hallucinations, and sometimes trainers make marking errors after seeing these hallucinations.

Sometimes, real-world errors may be scattered throughout many parts of an answer. Our work focuses on errors that can be pointed out in one place, but in the future, we also need to address scattered errors.

The help CriticGPT can provide is limited: if a task or response is extremely complex, even experts with model assistance may not be able to assess it correctly.

Next Steps
To align increasingly complex AI systems, we need better tools. In our research on CriticGPT, we found that applying RLHF to GPT-4 is promising to help humans generate better RLHF data for GPT-4. We plan to further expand this work and put it into practice.