
OpenAI has trained its LLM to confess to bad behavior
Dec 3, 2025 · Researchers at the company can make an LLM produce what they call a confession, in which the model explains how it carried out a task and (most of the time) owns up to any bad behavior.
How confessions can keep language models honest | OpenAI
Dec 3, 2025 · We’re sharing an early, proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts. AI systems are becoming more capable, and we …
OpenAI has trained its LLM to confess to bad behavior
Confessions are a way to get a sense of what an LLM is doing without having to rely on chains of thought. But Naomi Saphra, who studies large language models at Harvard University, notes that no …
OpenAI has trained its LLM to admit to bad behavior
Dec 3, 2025 · Confessions are a method to get a way of what an LLM is doing without having to depend on chains of thought. But Naomi Saphra, who studies large language models at Harvard University, …
When AI Starts Confessing Its Lies: OpenAI’s ... - LinkedIn
Dec 4, 2025 · AI Daily Nutshell — The Day LLMs Learned to Confess What happens when an AI model starts telling the truth… About its lies? This is not science fiction anymore. It’s happening inside OpenAI.
The 'truth serum' for AI: OpenAI’s new method for training ...
Dec 4, 2025 · OpenAI researchers have introduced a novel method that acts as a "truth serum" for large language models (LLMs), compelling them to self-report their own misbehavior, hallucinations and …
[2512.08093] Training LLMs for Honesty via Confessions
Dec 8, 2025 · In this work we propose a method for eliciting an honest expression of an LLM's shortcomings via a self-reported *confession*. A confession is an output, provided upon request after …