Forty-Seven Prompts Changed Everything

In partnership with

Reading Time: 5 minutes

Hey Prompt Lover,

This is the last one.

Twenty newsletters. Twenty techniques pulled directly from The Prompt Report — the 200-page research paper covering 1,565 studies on prompting and prompt engineering that I've been breaking down for you since February.

We covered the five-component structure. Prompt sensitivity. Role prompting. Zero-shot techniques. Few-shot prompting and why example order swings accuracy by 40 points. Chain-of-Thought. Contrastive examples.

Self-Consistency. Decomposition. Tree-of-Thought. Self-Refine. Chain-of-Verification. Meta-prompting. ProTeGi. Answer engineering. Multilingual prompting. Multimodal techniques. Manual RAG. Sycophancy. Prompt injection. Bias. The benchmarking results.

Twenty techniques. Twenty working prompts. Everything tested before it landed in your inbox.

Today we end with the section of The Prompt Report I've been thinking about since the first time I read it.

Not because it's the most technical section. It's not. Not because it introduces a new technique you haven't seen. It doesn't really.

Because it's the most honest thing I've read about what prompting actually is. Written by researchers. In a peer-reviewed paper. And it confirms something most people who work with AI every day already feel but don't have the language for.

Let me tell you what happened.

How Jennifer Aniston’s LolaVie brand grew sales 40% with CTV ads

For its first CTV campaign, Jennifer Aniston’s DTC haircare brand LolaVie had a few non-negotiables. The campaign had to be simple. It had to demonstrate measurable impact. And it had to be full-funnel.

LolaVie used Roku Ads Manager to test and optimize creatives — reaching millions of potential customers at all stages of their purchase journeys. Roku Ads Manager helped the brand convey LolaVie’s playful voice while helping drive omnichannel sales across both ecommerce and retail touchpoints.

The campaign included an Action Ad overlay that let viewers shop directly from their TVs by clicking OK on their Roku remote. This guided them to the website to buy LolaVie products.

Discover how Roku Ads Manager helped LolaVie drive big sales and customer growth with self-serve TV ads.

The DTC beauty category is crowded. To break through, Jennifer Aniston’s brand LolaVie, worked with Roku Ads Manager to easily set up, test, and optimize CTV ad creatives. The campaign helped drive a big lift in sales and customer growth, helping LolaVie break through in the crowded beauty category.

Learn more.

Here's What Happened

A research team decided to document a real prompt engineering process from start to finish. Not a cleaned-up version. Not a case study written after the fact where the narrative is tidy and the ending is clear.

The actual process. Every step. Every failure. Every accidental discovery. Every moment where something worked and nobody understood why.

The task was detecting "entrapment" in Reddit posts from people experiencing suicidal ideation. Entrapment is the feeling of being trapped with no escape. Identifying it accurately in text has real clinical value.

High stakes. Specific domain. The kind of task where generic prompting fails immediately and the difference between a good prompt and a bad one is not obvious from the outside.

The researcher was not a beginner. This was serious work approached seriously.

Here's what forty-seven steps looked like.

Step one: the prompt returned the wrong format. Zero percent accuracy. Not because the reasoning was wrong. Because the output structure didn't match what the evaluation system expected. An answer engineering problem on the first attempt.

The researcher switched models. GPT-4-1106 kept giving mental health advice instead of classification labels regardless of how the instructions were written. The model's safety training was overriding the task instructions. A guardrail problem nobody anticipated. They moved to GPT-4-32K. Problem solved. Lesson one: sometimes the model is the problem, not the prompt.

They added context. F1 score of 0.40. They added ten examples. F1 score of 0.45. They developed AutoDiCoT — their own technique, invented mid-process, for automatically generating chain-of-thought reasoning from misclassified examples. Scores improved.

Then something happened that I've thought about more than almost anything else in 200 pages.

The researcher accidentally duplicated a paragraph in the prompt. Copy-paste error. Didn't notice immediately. Ran the prompt. Performance improved.

They noticed the duplicate. Assumed it was noise. Removed it. Performance dropped significantly.

They put it back. Performance recovered.

Nobody could explain why. The duplicated paragraph contained context that appeared elsewhere in the prompt. Logically it was redundant. Empirically it was doing something. The researcher tried variations. Removing just the name in the paragraph crashed performance. Anonymizing it hurt results. The paragraph, duplicated, exactly as it appeared by accident, was load-bearing in a way that defied explanation.

After twenty hours and forty-seven documented steps, the best F1 score the human engineer achieved was 0.53.

Then the team ran DSPy. An automated prompt optimization framework. Sixteen iterations. No human decisions about what to change. No theories about what would work. Just systematic testing toward better performance.

DSPy achieved 0.548. Better than twenty hours of expert human work. In sixteen automated iterations.

And here's what DSPy's winning prompt looked like when the researchers examined it.

It used fifteen examples with no chain-of-thought reasoning. The human engineer had concluded that chain-of-thought was essential. DSPy found it wasn't. It made no use of the professor's email that the human engineer had identified as critical context. The human engineer had spent significant effort on that email. DSPy ignored it entirely. It didn't include the "explicit" instruction the human engineer believed was important. DSPy didn't need it.

Everything the human engineer was certain about, the automated system discarded. And performed better.

The Eight Lessons The Researchers Documented

They didn't bury this. They published it. Eight explicit lessons from forty-seven steps.

One. Model guardrails may block your task entirely. Switching models is a legitimate solution, not a workaround.

Two. Looking at individual failing examples reveals patterns that aggregate metrics hide. When your F1 score isn't moving, stop looking at the score and start reading the failures one by one.

Three. Accidental improvements are real. Document everything. Remove nothing without testing first. The thing you're about to delete because it seems redundant might be the thing that's holding your prompt together.

Four. The metric you're optimizing for may not match the actual goal. In clinical work, missing a real case is far worse than a false alarm. Optimizing for overall accuracy hides that asymmetry. Know what kind of error is more costly before you decide what to improve.

Five. The prompt engineer and the domain expert need to work together throughout the process, not just at the beginning. The researcher's best improvements came from conversations with the clinical expert, not from prompt iteration alone.

Six. Automated optimization outperformed twenty hours of expert human work. Not always. In this case, definitively.

Seven. Combining automated and human engineering outperformed either alone. The best results came when both approaches were used together. Not a competition. A collaboration.

Eight. This is the one the researchers put in the published paper verbatim. Prompting "remains a difficult to explain black art."

Not my words. Theirs. Written by the people who spent years reviewing 1,565 papers on the subject.

Why This Matters More Than Any Technique In This Series

Every technique we covered over the past twenty newsletters is real. The research behind each one is solid. The prompts work. I tested them.

But the 47-step case study is the honest context for all of it.

Here's what it's actually telling you.

The techniques are not guarantees. They're starting points with evidence behind them. Few-shot prompting consistently outperforms zero-shot. Chain-of-Thought improves reasoning tasks. Self-Refine catches problems generation misses. These findings are real and reproducible.

But your specific task, your specific model, your specific context will behave in ways the general research didn't capture. A duplicated paragraph will improve performance for no explainable reason. A technique the research says is essential will turn out to be unnecessary for your use case. An automated system will find a better prompt than the one you spent a week building.

The case study doesn't undermine the techniques. It tells you how to hold them. As strong starting points that require testing in your specific context. Not as rules that guarantee results.

The researcher started with the best available knowledge and tested their way to something better. Then an automated system tested its way to something better than that. The progression wasn't a failure. It was the process working correctly.

That's what prompting actually is. Not writing the perfect prompt. Testing toward a better one.

The Prompt That Ties Everything Together

This is the last prompt of the series. It's not a technique. It's a process. The one the 47-step case study demonstrates and the one that sits underneath every technique we covered.

▼ COPY THIS TEMPLATE — THE TESTING LOOP:

Step 1 — Start simple: [Build the minimum viable prompt. Role. Directive. Context. Nothing more than necessary.]

Step 2 — Run it and read the failures: Run the prompt on ten real inputs. Don't look at the aggregate score first. Read the individual failures. For each one, write one sentence describing what went wrong and why.

Step 3 — Generate one targeted fix: Based on the failures, identify the single most common problem. Change one thing in the prompt to address it. One thing. Not three. Not a rewrite. One targeted change.

Step 4 — Test the change: Run the same ten inputs. Compare the failure rate to Step 2. Did the change help? Did it hurt? Did it help on the original failures but create new ones?

Step 5 — Document everything: Keep every version of the prompt. Note what changed and what happened. The thing you remove today might need to come back tomorrow. Document before you delete.

Step 6 — Repeat: Run three to five cycles before concluding the prompt is good. Not satisfied. Good. The difference matters.

Step 7 — Consider automation: If the prompt needs to perform at scale and consistency matters, run an automated variation test using the meta-prompt from Newsletter 15 or the ProTeGi cycle from Newsletter 16. Let the system test things you wouldn't think to try.

What The Entire Series Was Actually About

We started in February with a five-component prompt structure. We end in March with a 47-step case study that shows those components are the beginning of the work, not the end of it.

Everything in between — the techniques, the templates, the research findings — was pointing at the same thing from different angles.

Good prompt work is iterative. It requires testing. It produces accidental discoveries you should document instead of delete.

It benefits from domain expertise you probably don't have alone. It can be improved by automated systems that test without your assumptions. And after all of it, after 1,565 papers and 200 pages and twenty hours of documented expert work, it remains, in the researchers' own words, a difficult to explain black art.

That's not discouraging. That's accurate.

You're not bad at prompting because your prompts need iteration. Everyone's do. You're not missing something obvious because some things work for reasons that aren't obvious. The researchers who wrote the definitive survey of the field said the same.

What separates people who get consistently strong results from AI from people who don't is not knowing a technique the others haven't heard of. It's the habit of testing, documenting, and iterating rather than accepting the first output that looks good enough.

That habit is what this series was trying to give you.

What Comes Next

The series is done. Twenty newsletters. Every major finding from The Prompt Report translated into working prompts you can use today.

But the field isn't done. The Prompt Report covers the research up to its publication date. New techniques are being published. New findings are coming. Models are changing in ways that will break some of what we covered and improve other parts of it.

I'll keep testing. I'll keep sharing what survives real use. That's what this newsletter has always been.

If you've been reading since Newsletter 1, thank you. You stuck with 200 pages of research broken into twenty issues and you showed up for every one. That means something.

If you have a prompt you've been struggling with, reply and send it. I'll look at it.

If you tested something from this series on real work and got a result worth sharing, reply and tell me. I read every one.

The research is done. The prompting isn't.

— Prompt Guy