Apple caused an industry stir with paper challenging reasoning models

The paper found that advanced models collapse when facing hard tasks. It’s drawn several rebuttals.

A stylized brain plugged into circuitry diagrams — Just_super/Getty Images

June 17, 2025

• 3 min read

Power up. From supercharging employee productivity and streamlining processes to inspiring innovation, Microsoft’s AI is designed to help you build the next big thing. No matter where you're starting, push what's possible and build your way with Azure's industry-leading AI. Check it out.

Can reasoning models actually, well, reason through problems? Sure, up to a point.

At least that was the conclusion of a new paper from machine learning researchers at Apple this month, and it landed with a splash in an industry that’s staked hopes for AI superintelligence on this emerging capacity for reason.

In the paper, titled “The Illusion of Thinking,” Apple researchers found that large reasoning models (LRMs) “face a complete accuracy collapse beyond certain complexities.” At a certain difficulty level, reasoning models seemed to give up, experiencing a “counterintuitive scaling limit” as the puzzle games they were tested on became more difficult. Apple tested reasoning models like OpenAI’s o1 and o3 mini, Anthropic’s Claude 3.7 Sonnet Thinking, and DeepSeek R1.

Without reason: The researchers also studied how reasoning models “think” by examining reasoning traces—or the model’s own explanation of the steps it took. They compared them to standard LLMs and found that LLMs outperformed reasoning models at low-complexity tasks, while reasoning models beat out LLMs at medium-complexity levels. Both types failed at highly complex puzzles.

“Our detailed analysis of reasoning traces further exposed complexity-dependent reasoning patterns, from inefficient ‘overthinking’ on simpler problems to complete failure on complex ones,” the researchers wrote.

“These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning.”

Hot topic: The research, which was published before Apple’s annual Worldwide Developers Conference, immediately engendered lively debate. Some critics questioned the framing of the findings and the lack of human comparison.

Wharton School associate professor Ethan Mollick, who writes about AI, said on LinkedIn that the “conclusions are being vastly overstated” and compared the conversation around it to other harbingers of potential AI collapse that didn’t end up panning out. AI expert and noted LLM skeptic Gary Marcus rounded up some of the rebuttal arguments on his Substack.

Why it matters: Since OpenAI debuted o1 last fall, advanced reasoning models that can work through logic problems, complex code, and other problem-solving have become key to generative AI’s recent advancements, including sophisticated agents. But the paper could raise concerns about trustworthiness for businesses seeking to integrate these kinds of models into their operations.

Apple itself has also taken a different tack from its Big Tech competitors, with a focus on on-device AI and no big flagship foundation model to compete with the likes of Google’s and OpenAI’s.

Keep up with the innovative tech transforming business

Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.