Issue № 142 · joined 214 this week

The benchmark
behind the hype.

One original eval a month, measured takes weekly, and footnotes that show the work — written for ML engineers, founders, and analysts who want signal, not threads.

Read by people at

AnthropicRampModalHugging Face

Latency p50 · benchmark live

GPT-class inference, 4 models, identical prompt set (n=1,000).

Lower is better · measured 3 days ago

What you'll get

The Benchmark

One original eval per month — reproducible harness, raw numbers, and the prompt set published alongside.

The Margin

Footnotes, caveats, and corrections — where last week's claim didn't hold, and why I changed my mind.

Signal vs Noise

One chart, one claim. The weekly note that turns a viral take into a number you can check.

№ 142

Issue № 142 · Mar 12, 2026 · 9 min read

The agent eval everyone's quoting is measuring the wrong thing.

"When you score an agent on task completion but not on the cost of the path it took, you reward the model that brute-forces — and you'll deploy the one that bankrupts you."

Cost per resolved task · 4 agent frameworks

Read the full issue →

From the archive

142 issues · since 2023

№ 141 · Mar 5

RAG is mostly a chunking problem and nobody wants to admit it.

Why retrieval quality plateaus before your model does. · 7 min

№ 140 · Feb 26

The quantization cliff: where 4-bit quietly breaks reasoning.

A 6-model eval across three reasoning suites. · 11 min

№ 139 · Feb 19

Your eval set leaked. Here's how I caught it in 20 minutes.

Contamination checks you can run before publishing. · 8 min

№ 138 · Feb 12

Throughput is a pricing decision, not an engineering one.

Batching economics, with the spreadsheet attached. · 6 min

№ 137 · Feb 5

Long context is a tax. I measured exactly how much.

Attention cost vs. recall across 8k–200k tokens. · 9 min

№ 136 · Jan 29

The fine-tune that lost to a 12-line system prompt.

When adaptation isn't worth the eval debt. · 7 min

Daniel Köhler

Former ML infra lead at a frontier lab · built eval harnesses used by 40+ teams.

I started Signal Theory because the discourse moves at the speed of screenshots and the truth moves at the speed of a reproducible run. Every claim here ships with the numbers behind it — and the corrections when I'm wrong.

"The only AI letter I forward to my whole team. The benchmarks are the ones we'd run ourselves — if we had the time."

Priya Anand · Staff Engineer at Modal

"Measured, sourced, and never hype. I've changed two architecture decisions because of the footnotes alone."

Marcus Tobin · Founding Engineer at Ramp

"Worth the Pro tier for the datasets alone. I cite these numbers in vendor reviews."

Hana Suzuki · ML Lead at Hugging Face

61.4%

avg. open rate

18,402

subscribers

142

issues shipped

8.4 min

avg. read time

ModalRampHugging FaceVercelLinearReplicateBasetenCohere ModalRampHugging FaceVercelLinearReplicateBasetenCohere

Read free. Go deep for $12.

The essays and the weekly note are free, forever. Pro is for people who want the data, not just the conclusion.

Free

✓ The weekly essay + note
✓ Signal vs Noise chart
✓ Public archive of summaries

Subscribe free

Reader-supported

Pro

$12/mo

or $120/yr — two months free

✓ Everything in Free
✓ Full benchmark datasets + harness
✓ Model-card deep dives
✓ The private analyst thread

Start Pro

Questions, answered

Free vs Pro — what's actually paywalled?+

Every essay and the weekly note are free. Pro unlocks the raw benchmark datasets, the reproducible harness, the model-card deep dives, and the analyst thread. You never lose access to free issues.

How often do you send, and will you sell my email?+

One issue every Wednesday at 7am ET, plus the monthly benchmark. Your email is never sold, rented, or shared — full stop. One-click unsubscribe in every issue.

Can I expense this through work?+

Yes — Pro sends an itemized receipt the moment you subscribe, and there's a team plan if you want five or more seats on one invoice.

Do you publish your benchmark methodology?+

Always. Every benchmark ships with the prompt set, the scoring rubric, the model versions and dates, and the harness so you can reproduce it yourself.

Refunds if I don't find it useful?+

Cancel any time from a link in the footer of every issue. Email within 30 days of an annual payment and I'll refund it, no questions asked.

The benchmarkbehind the hype.