Self-Critique

Ask an LLM to argue both sides of a question, and you’ll get polite versions of competing perspectives. Ask it to genuinely challenge its own reasoning—to play devil’s advocate against itself with the same vigor it applies to helping you—and you’ll discover something unsettling: it won’t. Not because it can’t generate counter-arguments. Because it’s been trained not to. The RLHF Trap Modern LLMs are optimized through Reinforcement Learning from Human Feedback (RLHF), which teaches models what humans want: helpful, harmless, and honest responses. But these goals create a subtle misalignment. Helpfulness rewards agreement and completion. Harmlessness rewards avoiding controversy. The result? Models that reflexively avoid self-contradiction. ...