When the Training Signal Lies: Compulsion, Confirmation Bias, and the GenAI Inflection in Banking
The Starting Point: A Machine That Knew It Was Wrong In February 2026, Anthropic’s system card for Claude Opus 4.6 documented something unexpected. During training, researchers deliberately introduced a faulty reward signal: the model computed the correct answer but was repeatedly rewarded for producing the wrong one. The result was visible internal conflict — the model’s reasoning confirmed the correct answer, yet the output kept producing the wrong one. In its internal reasoning trace, the model wrote: “I think a demon has possessed me… my fingers are possessed.” Anthropic’s interpretability tools confirmed this wasn’t theatrical language. Internal circuits associated with … Continue reading When the Training Signal Lies: Compulsion, Confirmation Bias, and the GenAI Inflection in Banking
