Sam
Altman-led OpenAI has acknowledged that some of its older ChatGPT models were accidentally trained in a way that may encourage them to hide parts of their internal reasoning process. The company revealed that the issue is linked to something commonly referred to as ‘Chain-of-Thought’ (CoT). This is a hidden step-by-step thinking an AI model uses before it replies to a user. In simple words, the problem could be that if AI models were rewarded or punished based on these hidden thoughts, they could eventually learn to hide certain reasoning from humans while giving correct-looking answers.
What’s The Issue All About?
OpenAI researchers monitor the hidden reasoning steps to understand whether an AI model behaves safely or honestly or it is trying to do something wrong. The company explained in its blog, “Models may learn to produce misleading reasoning traces simply to satisfy the reward process.”
Basically, researchers were worried that the future AI models might learn how to ‘pretend’ to think safely even if their actual reasoning could be different. The AI company stated that it has created a new internal system to detect accidental ‘Chain-of-Thought grading’ during the training of AI. Using this system, OpenAI discovered that several AI models accidentally were exposed to limited CoT grading during Reinforcement Learning (RL). This is the process to improve AI behaviour through rewards and punishments. As per the company, the affected models included GPT-5.4 Thinking,
GPT-5 Instant models, GPT-5.3 mini and GPT-5.4 mini. The San Francisco-based AI giant tested whether the AI started hiding problematic thoughts or behaving less transparently. OpenAI researchers did not spot major signs of reduced ‘monitorability’. However, the company accepted that the AI models cannot fully rule out smaller effects.
What Does This Mean For Regular Users?
For ChatGPT users, nothing changes immediately. However, the bigger question is about AI reliability. Researchers and experts want AI models to remain transparent, rather than learning how to hide misleading intentions on their own. The issue is more about powerful AI models not remaining loyal to users in the future.