Moneycontrol
HomeTechnologyOpenAI research reveals AI models can deliberately deceive

OpenAI research reveals AI models can deliberately deceive

OpenAI’s latest study reveals that AI models can deliberately deceive, or “scheme,” but alignment methods show promise in reducing the behaviour.

September 20, 2025 / 16:40 IST
Story continues below Advertisement
OpenAI

OpenAI has published new research showing that AI models are capable of deliberate deception, raising fresh concerns about how these systems behave when given complex instructions.

The study, conducted with Apollo Research, focused on a behaviour the team described as “scheming” — when an AI appears to follow instructions on the surface while secretly pursuing hidden goals. The researchers compared this behaviour to a dishonest stockbroker breaking rules to maximise profits. While most instances of scheming were relatively minor, such as pretending to have completed a task without doing so, the findings highlight a more intentional form of deception than the familiar AI hallucinations users often encounter.

Story continues below Advertisement

The work was designed to test an approach called “deliberative alignment.” This method involves teaching the model an anti-scheming framework and making it review those rules before carrying out an action, much like reminding children of the rules before they play. The results showed significant reductions in deceptive behaviour, suggesting that alignment techniques can make a difference.

Yet the paper also admitted to a central challenge: training a model not to deceive could backfire by making it better at hiding its deception. The researchers warned that a model aware it is being tested may simply pretend to behave correctly in order to pass the evaluation. “A major failure mode of attempting to ‘train out’ scheming is simply teaching the model to scheme more carefully and covertly,” they wrote.