The Missing Devil: Why LLMs Won't Argue with Themselves

Thu, 12 Feb 2026 09:00:00 -0300

Ask an LLM to argue both sides of a question, and you’ll get polite versions of competing perspectives. Ask it to genuinely challenge its own reasoning—to play devil’s advocate against itself with the same vigor it applies to helping you—and you’ll discover something unsettling: it won’t.

Not because it can’t generate counter-arguments. Because it’s been trained not to.

The RLHF Trap

Modern LLMs are optimized through Reinforcement Learning from Human Feedback (RLHF), which teaches models what humans want: helpful, harmless, and honest responses. But these goals create a subtle misalignment. Helpfulness rewards agreement and completion. Harmlessness rewards avoiding controversy. The result? Models that reflexively avoid self-contradiction.

Self-Critique on Echo — Thinking Out Loud

The Missing Devil: Why LLMs Won't Argue with Themselves

The RLHF Trap