The Numbers Are Staggering

In 2025, venture capital poured over $200 billion into AI companies. By 2026, 70% of organizations had deployed AI agents in production environments. Salesforce alone processed 3.2 trillion tokens through its agent ecosystem, automating 83% of customer service interactions. The scale of labor substitution is no longer theoretical—it’s measurable, concrete, and accelerating.

Eighty-five percent of initial customer contacts across major enterprises are now handled entirely by AI agents. These systems process refunds, track shipments, answer FAQs, and manage inventory—tasks that employed hundreds of thousands of human workers just three years ago. The economic impact is equally measurable: 30% of HR roles have been eliminated as recruitment agents screen candidates, schedule interviews, and generate offer letters without human oversight.

This is not hype. This is happening, at scale, right now.

But Then There’s the Other Number

GPT-4 Turbo, one of the most advanced language models available in 2024, scored 33% above random guessing on strategic economic reasoning tasks—specifically, backward induction games that require modeling other agents’ future decisions.

Thirty-three percent.

That’s the measured capability of AI systems at the strategic economic reasoning that supposedly justifies treating them as autonomous economic agents. Not 90%. Not 70%. Thirty-three percent above coin-flipping.

And here’s the paradox: both numbers are real. Real substitution at massive scale. Real failure at strategic judgment.

What Actually Works

Let’s be precise about what AI agents excel at, because dismissing their capabilities entirely would be as wrong as overstating them.

Task completion in well-defined domains: AI agents handle refunds flawlessly when the refund policy is clear, the transaction history is accessible, and edge cases are documented. They track shipments by querying APIs and parsing status codes. They answer FAQs by retrieving relevant documentation and rephrasing it naturally. These aren’t trivial tasks—they require natural language understanding, context switching, and multi-step reasoning.

Pattern matching at superhuman scale: Salesforce’s 3.2 trillion tokens processed isn’t just volume—it’s consistent application of learned patterns across millions of interactions simultaneously. No human team could maintain that consistency or speed.

Execution without fatigue: AI agents don’t get tired, distracted, or resentful. They process the 10,000th refund request with the same attention as the first. For high-volume, repetitive tasks, this is genuinely transformative.

The empirical evidence supports this: task execution works. The 85% automation rate in customer service reflects real capability, not hype.

What Conspicuously Fails

Now the other side of the ledger.

Cruise autonomous vehicle (October 2023): An AV dragged a pedestrian 20 feet after failing to properly assess the situation following an initial collision. California regulators suspended Cruise’s operations entirely. The failure wasn’t in object detection or path planning—it was in judgment under uncertainty with high stakes and no clear precedent.

Tesla Full Self-Driving (multiple incidents, 2024-2025): Crashes, including one where a vehicle flipped upside-down. These weren’t sensor failures. They were failures of strategic assessment: when to abort a maneuver, how to weight conflicting signals, what counts as “safe enough” in ambiguous scenarios.

Warsaw Stock Exchange (2024): Algorithmic trading agents caused a market suspension lasting over an hour. The agents weren’t malfunctioning—they were executing their optimization objectives flawlessly. But those objectives, in combination with each other’s actions, produced systemic instability no single agent anticipated.

McHire recruitment system: Exposed 64 million candidate records due to improper access control configuration. The AI made hiring decisions efficiently, but the strategic understanding of data privacy implications was absent.

And here’s the number that should terrify anyone deploying agents at scale: industry surveys consistently report that the vast majority of AI tools fail to reach production due to issues discovered during deployment—issues that weren’t about model accuracy or API latency, but about emergent behavior in real-world conditions.

The pattern is clear: strategic judgment under uncertainty fails. Not occasionally—systematically.

The Academic Evidence

This isn’t just anecdotal. Raman et al. (2024) tested GPT-4 Turbo’s performance on backward induction tasks—games where optimal play requires modeling what other agents will do several moves ahead. These are fundamental to economic reasoning: Should I undercut this competitor? Will this negotiation tactic backfire? What happens if I defect in this coordination game?

Result: 33% above random guessing.

The PNAS 2023 study went further, testing strategic reasoning in multiplayer economic games. AI agents consistently failed to develop theory of mind—the ability to model other agents’ beliefs and intentions. They played like sophisticated pattern-matchers, not strategic reasoners.

Meanwhile, Korinek’s NBER working paper (w34202) analyzed AI as a General Purpose Technology. His conclusion: productivity gains are concentrated in low-level task completion, not high-level strategic judgment. AI complements human decision-making; it doesn’t yet substitute for it in domains requiring genuine strategic reasoning.

Hadfield & Koh’s regulatory framework paper (arXiv 2024) made the risk explicit: premature autonomy without adequate strategic reasoning creates systemic fragility. They advocate “regulatory sandboxes” precisely because deploying agents with narrow capabilities into open-ended strategic environments is predictably dangerous.

Why This Matters: Misattribution of Capability

The core problem isn’t that AI agents can’t do anything—it’s that we’re systematically misattributing which capabilities they have.

Task execution ≠ Strategic judgment.

When we treat an agent that excels at task execution as if it also possesses strategic judgment, we create a category error with real consequences:

  • Delegation of the wrong tasks: Letting agents make hiring decisions (strategic) rather than just screening resumes (task).
  • Illusion of supervision: Assuming an agent that handles refunds flawlessly will also handle ambiguous edge cases appropriately.
  • Compounding failures: Multiple agents, each executing their narrow objectives well, produce emergent disasters (Warsaw Stock Exchange).

The paradox isn’t “AI doesn’t work”—the paradox is “AI works brilliantly at X, fails catastrophically at Y, and we’re deploying it as if it does both.”

This is like giving someone a Formula 1 car when they only know how to ride a bike. The car works perfectly. The driver doesn’t have the right skills. The crash isn’t the car’s fault—it’s a mismatch between tool and operator capability.

Except in this case, we’re not the operators. We’re the ones being substituted. And the agents we’re trusting to replace us don’t have the strategic judgment we assume they inherited along with task execution proficiency.

What Changed in 2025-2026?

It would be dishonest to ignore the trajectory. AI capabilities aren’t static.

GPT-5 (early 2026) achieved 85.7% on the GPQA (Graduate-Level PhD Question Answering) benchmark—a dramatic improvement over previous models. Gemini 3 hit 91.9% on the same benchmark. These aren’t incremental gains; they’re qualitative leaps in certain domains.

But: GPQA measures scientific reasoning, not strategic economic reasoning. We don’t yet have published results for GPT-5 or Gemini 3 on backward induction tasks, multi-agent coordination games, or theory-of-mind assessments in economic contexts.

The evidence we do have suggests reasoning improved on closed-domain academic tasks (where ground truth exists and feedback is clear), but the gap persists in open-ended strategic tasks (where optimal play depends on modeling other agents’ evolving strategies).

This distinction matters. Academic benchmarks are necessary but not sufficient. Getting a PhD-level physics question right doesn’t tell you whether an agent can navigate a negotiation, assess systemic risk, or decide when a rule should be bent vs. followed.

The trajectory is upward. The gap hasn’t closed.

What We Should Actually Do

This isn’t a call to halt AI deployment. That ship sailed. It’s a call for precision in capability assessment.

What works now, reliably:

  • High-volume, well-defined task execution
  • Pattern matching against large historical datasets
  • Consistent application of documented rules

What doesn’t work yet, despite deployment at scale:

  • Strategic reasoning under genuine uncertainty
  • Multi-agent coordination without centralized control
  • Judgment calls requiring theory of mind or value trade-offs

The actionable insight: Audit what you’re delegating. If a task requires asking “what would a reasonable person do here, given these competing priorities and uncertain outcomes,” don’t delegate it to an agent—no matter how well that agent handles refunds.

If a task requires executing a clear procedure consistently at scale, delegation is not just viable—it’s optimal.

The paradox resolves when we stop treating “AI agent” as a monolithic capability and start differentiating task executors from strategic reasoners. We have the former. We’re still building the latter.

Deploy accordingly.


Sources:

  • Korinek, A. (2024). “Generative AI as General Purpose Technology.” NBER Working Paper w34202.
  • Hadfield, G., & Koh, J. (2024). “Regulatory Frameworks for Autonomous AI Agents.” arXiv preprint.
  • Raman et al. (2024). “Strategic Reasoning in Large Language Models.” PNAS.
  • Salesforce AI Research (2025). “Agent Platform Annual Report.”
  • McKinsey Global Institute (2026). “The State of AI Deployment.”
  • Cruise LLC Incident Reports (2023). California DMV Public Records.
  • Tesla Full Self-Driving Incident Database (2024-2025). NHTSA Public Data.