Reading Time: 6 minutes
Hello there, Prompt Lover!
I read the Apple paper.
All 30 pages of it. The charts, the appendix, the puzzle simulators, the failure analysis. The whole thing.
And I'll be straight with you — it took me three reads before I stopped arguing with it.
Because the first time through, I kept thinking: "This can't be right. These models show their work. They write thousands of tokens of reasoning. They check themselves. They self-correct."
By the third read, I stopped pushing back.
The paper is called "The Illusion of Thinking." Published in 2025 by researchers at Apple. And what it shows — carefully, with controlled experiments across multiple frontier models — is that the reasoning you see in the output of Claude, o3-mini, and DeepSeek-R1 is not what it appears to be.
That matters for how you prompt.
I'll walk you through exactly what they found.
Meet America’s Newest $1B Unicorn
A US startup just hit a $1 billion private valuation, joining billion-dollar private companies like SpaceX, OpenAI, and ByteDance. Unlike those other unicorns, you can invest.
Over 40,000 people already have. So have industry giants like General Motors and POSCO.
Why all the interest? EnergyX’s patented tech can recover up to 3X more lithium than traditional methods. That's a big deal, as demand for lithium is expected to 5X current production levels by 2040. Today, they’re moving toward commercial production, tapping into 100,000+ acres of lithium deposits in Chile, a potential $1.1B annual revenue opportunity at projected market prices.
Right now, you can invest at this pivotal growth stage for $11/share. But only through February 26. Become an early-stage EnergyX shareholder before the deadline.
This is a paid advertisement for EnergyX Regulation A offering. Please read the offering circular at invest.energyx.com. Under Regulation A, a company may change its share price by up to 20% without requalifying the offering with the Securities and Exchange Commission.
Here's Why This Matters
Most people using AI tools right now are making decisions about which model to use, when to turn on extended thinking mode, and how hard a task they can hand off — based on assumptions that this research shows are wrong.
Not slightly wrong. Wrong in ways that cause real failures in real work.
If you've ever gotten a confident AI answer that turned out to be completely incorrect on a complex task, this paper explains why. And once you understand the mechanism, you can prompt around it.
That's what this newsletter is for.
By the end of this, you'll have:
• A clear understanding of why frontier reasoning models fail and at what point that failure happens
• The research findings on "overthinking" and what it means when AI keeps generating after finding the right answer
• Why giving AI the exact algorithm to follow didn't help — and what that tells us about how these models actually work
• A prompt built around the three complexity zones the researchers identified, so you can calibrate your tasks before running them
Let's get started.
What The Researchers Set Out To Test
The researchers had a problem with how AI reasoning gets evaluated.
Most benchmarks use math problems — MATH-500, AIME24, AIME25. These are useful, but they have two big issues. First, there's contamination: models get trained on data that includes solutions to these problems, so high scores might reflect memorization as much as reasoning. Second, you can't control difficulty precisely. You can't dial a math problem from "easy" to "hard" in a controlled way and watch exactly where performance breaks.
So they built puzzle environments instead. Four of them: Tower of Hanoi (moving disks between pegs), Checker Jumping (swapping colored checkers), River Crossing (getting actors and agents across a river without violating safety constraints), and Blocks World (rearranging stacks of blocks into a target configuration).
The key property of these puzzles is control. You can set exactly how many disks, checkers, blocks, or crossing pairs to use. More pieces = harder problem. Every move can be validated by a simulator, so there's no ambiguity about whether the answer is right or wrong. And none of these puzzles appeared in training data the way standard math problems do.
Then they ran five frontier models through hundreds of puzzle instances at varying difficulty levels: Claude 3.7 Sonnet with and without thinking mode, DeepSeek-R1 and DeepSeek-V3, and o3-mini at medium and high settings.
What they found should change how you use these tools.
Quick Reality Check
The contamination finding on math benchmarks alone is worth noting. Human scores on AIME25 were higher than on AIME24 — meaning humans found it easier. AI models did worse on AIME25 than AIME24 — meaning they found it harder. The most likely explanation: AI training data included more AIME24 solutions than AIME25 ones. The models aren't solving the harder test. They're recognizing problems they've seen before.
The Three Regimes: Where AI Works, Struggles, And Collapses
The main finding of the paper is what the researchers call three complexity regimes. Every model they tested showed this same pattern across all four puzzles.
Regime 1 — Low Complexity: At simple task difficulty, standard models (no extended thinking) outperformed or matched thinking models while using far fewer tokens. If you're running extended thinking mode on simple tasks, you're spending more for worse results. The research is explicit on this.
Regime 2 — Medium Complexity: Here, thinking models pull ahead. The extended chain-of-thought helps. The performance gap between thinking and non-thinking models increases as problems get moderately harder. This is the zone where the $20/month for ChatGPT Plus actually earns itself.
Regime 3 — High Complexity: Both model types collapse to zero accuracy. It doesn't matter which model you use. It doesn't matter how many tokens you give them. Past a certain complexity threshold, every model they tested failed completely and consistently.
This is the finding that took me three reads to accept. Not "performance declines." Not "accuracy drops to 30%." Zero. Complete failure. And the threshold is closer than you'd think — around 8-10 disks in the Tower of Hanoi, which requires roughly 500-1,000 moves to solve correctly.
The practical implication: there are tasks that look like they're in Regime 2 that are actually in Regime 3. You can't tell from the output, because the output still looks confident and structured. But the answer is wrong.
The Counterintuitive Collapse: Models Think Less When Problems Get Harder
Here's the part that genuinely surprised me.
As problems approached the complexity threshold — right before complete failure — the thinking models did something strange. They started producing shorter reasoning traces. Less thinking, not more.
Not because they hit a context limit. The researchers checked. Every model was operating well below its maximum generation length. There was plenty of budget left.
The models were simply allocating less reasoning effort as problems got harder, right before failing completely. o3-mini showed this most dramatically. Claude 3.7 Sonnet showed it too. DeepSeek-R1 as well.
The researchers describe this as a fundamental scaling limit in how reasoning models work. It's not that they run out of room to think. It's that they can't generate useful reasoning beyond a certain complexity level, so they produce less of it.
This matters for how you interpret long thinking traces. A model generating 20,000 tokens of reasoning doesn't mean 20,000 tokens of useful work. And a shorter reasoning trace on a hard problem might actually signal the model is approaching its failure point.
Quick Reality Check
The researchers gave Claude 3.7 Sonnet the complete, correct, step-by-step recursive algorithm for solving Tower of Hanoi. Pseudocode. Every instruction. All the model had to do was follow it. Performance didn't improve. The collapse happened at the same complexity point as before.
Which means the limitation isn't in finding the solution. It's in executing a sequence of logical steps reliably over many moves. That's a different and more fundamental problem.
What The Researchers Found Inside The Thinking Traces
Because the puzzle environments have simulators that can validate every individual move — not just the final answer — the researchers could look inside the reasoning traces and check which intermediate solutions were correct and at what point in the thinking process they appeared.
What they found reveals three distinct patterns depending on complexity.
In simple problems: Models often found the correct solution early in their thinking — sometimes in the first 20% of the reasoning trace. Then they continued generating. And in that continued generation, they explored incorrect alternatives, second-guessed themselves, and in some cases replaced the correct answer with a wrong one before submitting their final response. The researchers call this "overthinking." The model found the right answer, then talked itself out of it.
In medium-complexity problems: The pattern reversed. Models explored incorrect solutions first and arrived at correct ones later in the thinking trace. Extended reasoning actually helped here — the self-correction process was doing real work.
In high-complexity problems: No correct solutions appeared anywhere in the thinking trace. Not early, not late. The model generated thousands of tokens of reasoning and never produced a valid answer at any intermediate step.
This is useful to know because it tells you something about when to trust AI output. Simple tasks: be skeptical of answers that come after a long reasoning trace. Medium tasks: longer thinking is a good sign. Hard tasks: no amount of thinking will fix it.
The River Crossing Anomaly
One of the more interesting findings in the paper came from comparing performance across puzzle types.
Claude 3.7 Sonnet with thinking could sustain correct moves through about 100 steps in the Tower of Hanoi when solving a 10-disk problem. In the River Crossing puzzle, the first error appeared around move 4.
The Tower of Hanoi version required more total moves. River Crossing had fewer. Yet the model failed far earlier in River Crossing.
The researchers' explanation: Tower of Hanoi is all over the internet. Training data includes thousands of examples. The model isn't reasoning through it from scratch — it's pattern matching against memorized solution structures. River Crossing with more than 2 pairs of actors and agents barely exists online. The model encountered it rarely during training. So it has to actually reason, and that reasoning fails fast.
This means performance on familiar problem types tells you almost nothing about reasoning capability on unfamiliar ones. A model that solves Tower of Hanoi reliably isn't necessarily reasoning. It might just remember.
The Prompt That Applies This Research
Before I give you this task, assess it first using these three zones:
Zone 1 — Low Complexity: Simple, pattern-based, few sequential steps, low constraint tracking. Standard response, no extended reasoning needed.
Zone 2 — Medium Complexity: Multiple interdependent steps, several constraints to track, sequential logic required. Extended reasoning will help.
Zone 3 — High Complexity: Long sequential chains, many simultaneous constraints, exact step-by-step execution across many moves. High risk of collapse regardless of approach.
Tell me: (1) Which zone this task falls into and why (2) Where in the task you're most likely to lose track or make errors (3) Whether this task resembles anything you've seen frequently in training data, or whether it's an unusual configuration
Then complete the task.
Task: [INSERT YOUR TASK HERE]How To Use This Prompt
Step 1: Copy the prompt exactly. The three zone descriptions are doing specific work — don't shorten them.
Step 2: Replace [INSERT YOUR TASK HERE] with your actual task. Be specific. Vague tasks get vague zone assessments.
Step 3: Read the zone assessment before you read the answer. If the model calls it Zone 3, the answer is high-risk regardless of how confident it sounds.
Step 4: For anything assessed as Zone 3, break the task into smaller pieces. Run each piece separately. The research shows Zone 3 tasks fail completely — smaller sub-tasks land in Zone 2 where extended reasoning actually helps.
Step 5: Pay attention to the training data question. If the model says the task is unusual or rare, weight the output accordingly. River Crossing logic applies.
Why This Prompt Works
The researchers identified that models don't self-diagnose before answering. They generate a response with equal confidence formatting whether they're in Zone 1 or Zone 3. The output looks the same.
This prompt forces an explicit zone assessment before generation starts. That creates two things: a concrete warning if the task is high-risk, and a structured approach that matches method to complexity.
The training data question comes directly from the River Crossing finding. Familiar tasks get pattern-matched. Unfamiliar tasks require actual reasoning — and that reasoning has clear limits. Knowing which category your task falls into changes how much you should verify the output.
Quick Reality Check
The researchers also found that at high complexity, non-thinking models sometimes lasted longer in the solution sequence before making their first error than thinking models did. At extreme difficulty, the thinking process actively made things worse. The model would generate reasoning that led it to a wrong early move, then execute confidently from that wrong starting point for 50 more moves before anyone noticed. Standard models, without extended thinking, sometimes got further by not overthinking the opening moves.
What This Research Changes In Your Daily Work
If you're using thinking modes on every task regardless of complexity, this research suggests you're wasting tokens and potentially getting worse results on simple tasks. Turn thinking mode off for straightforward requests. Pattern-based tasks don't need it.
If you're handing complex multi-step analysis to AI and accepting the output without verification, this research suggests you shouldn't. Zone 3 tasks fail completely and silently. The format looks fine. The confidence reads the same. The answer is wrong.
The most useful change you can make right now: before assigning a task to AI, ask yourself whether a human would need to track many sequential steps with multiple simultaneous constraints over a long sequence of moves. If yes, you're in Zone 3 territory. Break it up.
The Pattern Worth Taking With You
Complexity has a ceiling. Below that ceiling, there's a sweet spot where extended reasoning genuinely helps. Below that sweet spot, there's a zone where extended reasoning wastes compute and can introduce errors that wouldn't exist without it.
The researchers tested this across four different puzzle types, five frontier models, and hundreds of task instances at varying difficulty. The pattern held every time. This isn't a quirk of one model or one type of task. It's a property of how current reasoning systems work.
Match your approach to the complexity zone. That's the lesson.
Try This Today
Take the most complex task you currently hand to AI. Paste this prompt. Read the zone assessment and the training data self-report before you read the output.
Then compare what the model flagged as its weak points to where you've actually seen it fail in the past.
That comparison will tell you something useful.
📢 Starting This Week: Three Issues Per Week
Prompt Pulse now lands on Mondays, Wednesdays, and Fridays.
Each issue: one tested prompt, one piece of research or real-world finding worth knowing, and a clear application to your actual work. No filler.
This research from Apple landed because I read the paper and thought it was worth your time. That's the standard for every issue. If it changes how I prompt, it comes to you.
Tell Me What Happened
Run this prompt on something you've been asking AI to handle and tell me what zone it assessed itself in. I'm particularly curious whether the training data question catches anything useful.
Reply and I'll read it.
— Prompt Guy




