The benchmark
behind the hype.
One original eval a month, measured takes weekly, and footnotes that show the work — written for ML engineers, founders, and analysts who want signal, not threads.
Read by people at
GPT-class inference, 4 models, identical prompt set (n=1,000).
Lower is better · measured 3 days ago
What you'll get
The Benchmark
One original eval per month — reproducible harness, raw numbers, and the prompt set published alongside.
The Margin
Footnotes, caveats, and corrections — where last week's claim didn't hold, and why I changed my mind.
Signal vs Noise
One chart, one claim. The weekly note that turns a viral take into a number you can check.
Issue № 142 · Mar 12, 2026 · 9 min read
The agent eval everyone's quoting is measuring the wrong thing.
"When you score an agent on task completion but not on the cost of the path it took, you reward the model that brute-forces — and you'll deploy the one that bankrupts you."
Cost per resolved task · 4 agent frameworks
From the archive
142 issues · since 2023№ 141 · Mar 5
RAG is mostly a chunking problem and nobody wants to admit it.
Why retrieval quality plateaus before your model does. · 7 min
№ 140 · Feb 26
The quantization cliff: where 4-bit quietly breaks reasoning.
A 6-model eval across three reasoning suites. · 11 min
№ 139 · Feb 19
Your eval set leaked. Here's how I caught it in 20 minutes.
Contamination checks you can run before publishing. · 8 min
№ 138 · Feb 12
Throughput is a pricing decision, not an engineering one.
Batching economics, with the spreadsheet attached. · 6 min
№ 137 · Feb 5
Long context is a tax. I measured exactly how much.
Attention cost vs. recall across 8k–200k tokens. · 9 min
№ 136 · Jan 29
The fine-tune that lost to a 12-line system prompt.
When adaptation isn't worth the eval debt. · 7 min
"The only AI letter I forward to my whole team. The benchmarks are the ones we'd run ourselves — if we had the time."
"Measured, sourced, and never hype. I've changed two architecture decisions because of the footnotes alone."
"Worth the Pro tier for the datasets alone. I cite these numbers in vendor reviews."
61.4%
avg. open rate
18,402
subscribers
142
issues shipped
8.4 min
avg. read time
Read free. Go deep for $12.
The essays and the weekly note are free, forever. Pro is for people who want the data, not just the conclusion.
Free
$0
- ✓ The weekly essay + note
- ✓ Signal vs Noise chart
- ✓ Public archive of summaries
Pro
$12/mo
or $120/yr — two months free
- ✓ Everything in Free
- ✓ Full benchmark datasets + harness
- ✓ Model-card deep dives
- ✓ The private analyst thread
Questions, answered
Free vs Pro — what's actually paywalled?+
Every essay and the weekly note are free. Pro unlocks the raw benchmark datasets, the reproducible harness, the model-card deep dives, and the analyst thread. You never lose access to free issues.
How often do you send, and will you sell my email?+
One issue every Wednesday at 7am ET, plus the monthly benchmark. Your email is never sold, rented, or shared — full stop. One-click unsubscribe in every issue.
Can I expense this through work?+
Yes — Pro sends an itemized receipt the moment you subscribe, and there's a team plan if you want five or more seats on one invoice.
Do you publish your benchmark methodology?+
Always. Every benchmark ships with the prompt set, the scoring rubric, the model versions and dates, and the harness so you can reproduce it yourself.
Refunds if I don't find it useful?+
Cancel any time from a link in the footer of every issue. Email within 30 days of an annual payment and I'll refund it, no questions asked.