Using AI to Fix AI: OpenAI’s CriticGPT
Amid the current AI explosion, one of the main challenges facing this technology’s implementation is that AI can sometimes make mistakes. What’s more, the black-box nature of many AI tools means that catching these mistakes and understanding why they happen can be extremely difficult.
OpenAI recently discussed this problem – and a potential solution – in a recent blog post that was based on a research paper from the company. Here, the company announced CriticGPT – which is a model based on the GPT-4 architecture that identifies and highlights inaccuracies in responses generated by ChatGPT, particularly in coding tasks.
The OpenAI researchers found that when human reviewers use CriticGPT to assess ChatGPT’s code output, they outperform people without CriticGPT’s help 60% of the time. The implications of this work extend far beyond mere error detection and could reshape how we approach AI training, evaluation, and deployment.
Diving into the specifics, CriticGPT was trained using Reinforcement Learning from Human Feedback (RLHF). This is a method similar to what is used for ChatGPT itself. The approach involved AI trainers manually inserting errors into Chat-GPT-generated code and then providing feedback on those inserted mistakes. Using this procedure, OpenAI found that CriticGPT critiques are preferred by trainers over ChatGPT critiques in 63% of cases on naturally occurring bugs. This is due to CriticGPT producing fewer small, unhelpful “nitpick” complaints as well as the fact that CriticGPT hallucinates problems less often.
The study found that agreement among annotators was markedly higher for Critique Bug Inclusion (CBI) questions compared to other attributes such as nitpicks or comprehensiveness. This higher agreement on CBI questions suggests that identifying specific, predefined bugs is a more objective task than assessing other aspects of code quality or critique effectiveness.
The researchers also looked at the agreement between two critiques regarding choice. Fascinatingly, human workers frequently disagreed when contrasting two evaluations of overall quality, particularly for code responses with low ratings that came from the ChatGPT training set.
The paper discussed two types of evaluation data: Human Inserted Bugs and Human Detected Bugs. This dual approach provides a more comprehensive understanding of CriticGPT’s performance across different scenarios, including both artificially introduced errors as well as naturally occurring mistakes. When analyzing data containing Human Inserted Bugs that included a reference bug description, however, agreement greatly improved.
This pattern of agreement suggests that having clearly identified bugs provides a more concrete context for evaluation, allowing contractors to make more consistent judgments. It also draws attention to the difficulties in obtaining consistent assessments of AI-generated criticisms, especially when addressing more arbitrary aspects of code quality.
Additionally, OpenAI is quick to point out that CriticGPT isn’t doing all the work. They observed that human contractors often kept or modified the AI-generated comments, suggesting a synergistic relationship between human expertise and AI assistance.
There’s clearly more work to be done here, but OpenAI’s work with CriticGPT is a big step toward reducing the rate of mistakes generated by models like ChatGPT.