Post

Throughput is the new AI benchmark

Throughput is the new AI benchmark

TL;DR: The interesting part of recursive self-improvement is that it moves the AI conversation from model score to real work throughput. To know whether AI is useful, we should not only ask how it performs on benchmarks. We should ask whether it compresses real work cycles.


Everyone already knows harnesses matter. People have been saying for months that a model cannot just sit in a chat box and answer questions. It needs tools, loops, state, and feedback that changes the next step.

What makes Anthropic’s recursive self-improvement post worth reading is not “model + harness matters.” It is that Anthropic starts giving real metrics.

It does not just say AI improves productivity. It says that inside Anthropic’s own frontier AI lab, Claude is already involved in real engineering and research workflows, and engineer output has moved a lot.

Before, we asked:

1
How high is the model benchmark?

Now the better question is:

1
Does this system make an organization move faster?

Not a demo. Not a leaderboard.

Did a bug fix cycle get shorter? Did an experiment loop run faster? Can the same team finish more validated work?

Real work is the new benchmark.

8x Is Not A Benchmark. It Is Organizational Telemetry.

Anthropic’s productivity numbers matter, not because the numbers are perfect.

They matter because the conversation moves from “AI feels useful” to “we measured how much changed.”

Of course output is not value. More merged code does not automatically mean more value. More experiments do not mean better research taste.

But that does not make the signal useless, because enterprises do not need another model score. They need to know:

1
After using AI, did the work cycle compress?

Can the same team handle exceptions faster? Close tickets faster? Finish migrations faster? Keep routine work from getting stuck on humans?

That is closer to money than a benchmark.

Anthropic Is The Ceiling Case

Anthropic is not a normal company. It is an unusually good environment for agents: code-heavy, feedback-rich, AI-native, with mature tooling and people who know how to decompose ambiguous work.

That is not the current state of a traditional enterprise.

A logistics company will not suddenly become Anthropic. A bank will not. A hospital will not. Most SaaS companies will not either.

Traditional companies have legacy systems, permissions, compliance, dirty data, cross-team handoffs, and a lot of reality that has not entered the system.

So I would not read Anthropic’s numbers as “every company gets 8x.”

I would read them as a ceiling signal.

When work is already digital, measurable, and tool-accessible enough, AI agent throughput lift can be large.

But if your company cannot even see the work loop, buying a more expensive model does not help much.

You bought intelligence. You are missing the ground where intelligence can work.

The Real Unit Is Model + Harness + Feedback

Which matters more, harness or model?

GPT-3 plus a great harness would not get us here. The model itself has to be strong enough to understand the repo, interpret errors, do local reasoning, and decide whether the next action makes sense.

A strong model sitting in chat is still episodic. It can give you a smart paragraph, but it will not keep pushing the kind of long-horizon task Anthropic talks about.

The useful unit is the compound system:

1
2
3
4
model capability
x harness quality
x environment feedback
= effective system agency

The harness does not compensate for a weak model.

It puts model capability into a loop: understand the goal, decompose it, call tools, observe results, adjust the path, know when to stop, and know when to ask for human judgment.

A Normal Company’s Problem Is Not Which Model To Buy

So the enterprise lesson from Anthropic’s post is not “go buy Claude.”

It is to ask a few more annoying questions:

Can your workflow be observed?

Can your tools be called by an agent?

Is success measurable?

Can failure feed back into the system?

Does someone own the loop?

These questions come before model choice.

When I wrote Agents Run The Loop. Humans Face Reality., the point was: agents should run loops, and humans should move to the edges of reality, strategy, ethics, emotion, and taste.

This Anthropic post adds the missing first half.

For agents to run loops, the loop has to exist first. Not the workflow in the org chart. The real loop that is observable, executable, and feedback-driven.

Without that, agents just generate more summaries, more plans, and more motion. The company gets louder, not necessarily faster.

Throughput Is The New Feeling

This is how I now prefer to think about AI adoption: do not start by asking how smart the model is. Ask whether work got faster.

Not whether one person feels better using AI. Whether a team, a function, or a company compressed its cycle time.

Before that can happen, the company has to make its work agent-legible.

That is harder than switching models.

And much more real.



简单说: Recursive self-improvement 有意思的地方是:它把 AI 的讨论从 model score 拉到了 real work throughput。以后看 AI 有没有用,不能只看 benchmark,要看它有没有压缩真实工作的 cycle time。


大家都知道 harness 重要 - 几个月前大家就开始聊“Model 不能只是坐在 chat box 里回答问题。它要能用 tool,要能跑 loop,要能记住状态,要能从 feedback 里改下一步。”

Anthropic 这篇 recursive self-improvement 真正值得看的,是它开始给 real metrics。

它不只是说 AI 提高 productivity。它说,在 Anthropic 自己的 frontier AI lab 里,Claude 已经参与到大量真实 engineering 和 research workflow 里,工程师的 output 也出现了很明显的提升。

以前我们问:

1
这个 model benchmark 高不高?

现在更应该问:

1
这个 system 有没有让一个组织跑得更快?

不是 demo。不是 leaderboard。

一个 bug fix cycle 有没有变短。一个 experiment loop 有没有变快。一个 team 同样人数,能不能完成更多经过验证的工作。

真实工作才是新的 benchmark。

8x 不是 benchmark,是组织 telemetry

Anthropic 提到的 productivity 数字重要,不是因为数字一定完美。

数字重要,是因为它把讨论从“我感觉 AI 很有用”的感性认识变成了“我们测到了多少变化”这样的理性结果。

当然,output 不等于 value。多 merge 代码,不等于多创造价值。多跑 experiment,也不等于 research taste 变好了。

但这不影响它的重要性。因为企业真正需要的不是又一个模型分数。企业需要知道:

1
用了 AI 以后,work cycle 有没有被压缩?

同样一组人,能不能更快处理 exception。能不能更快 close ticket。能不能更快写完 migration。能不能让 routine work 少一点卡在人身上。

这比 benchmark 更接近钱。

Anthropic 是 ceiling case

Anthropic 不是普通公司。它是一个非常适合 agent 的环境:code-heavy、feedback-rich、AI-native、工具链完整,员工也知道怎么把模糊任务拆开。

这不是传统企业的现状。

物流公司不会突然变成 Anthropic。银行不会。医院不会。大部分 SaaS 公司也不会。

传统公司有 legacy system,有 permission,有 compliance,有 dirty data,有跨部门 handoff,还有大量 reality 没有进系统。

所以我不会把 Anthropic 的数字理解成“所有公司都能 8x”。

我更愿意把它理解成 ceiling signal。

意思是:当一个组织的 work 已经足够 digital、足够 measurable、足够 tool-accessible,AI agent 的 throughput lift 可以很大。

但如果你的公司连 work loop 都看不见,只是买了一个更贵的 model,那真的没什么用。

你买到的是 intelligence。你缺的是让 intelligence 工作的地面。

真正的 unit 是 model + harness + feedback

Harness 和 model,谁更重要?

GPT-3 加一个很好的 harness,做不到今天这些事。Model 本身必须足够强,能理解 repo,能解释 error,能做局部 reasoning,能判断下一步行动是不是靠谱。

一个强 model,如果只在 chat 里回答问题,还是 episodic。它可以给你很聪明的一段话,但它不会持续推进 Anthropic 所说那种 long-horizon task

真正有用的是 compound system:

1
2
3
4
model capability
x harness quality
x environment feedback
= effective system agency

Harness 的作用不是弥补 model 不行。

Harness 是把 model 的能力放进一个 loop 里:理解目标、拆分目标、调用工具、观察结果、修正路径、知道什么时候该停,什么时候该找人判断。

普通公司的问题不是买哪个 model

所以 Anthropic 这篇文章给普通公司的启发,不是“赶紧买 Claude”。

而是先问几个更烦的问题:

你的 workflow 能被观察到吗?

你的 tool 能被 agent 调用吗?

你的成功标准清楚吗?

失败能不能反馈回系统?

有没有人 owns the loop?

这些问题比 model choice 更早。

我在写 Agent 跑 Loop,人面对现实 的时候,重点是:agent 应该跑 loop,人应该去 reality、strategy、ethics、emotion、taste 这些 edge。

这篇 Anthropic 文章补了前半句。

Agent 要跑 loop,前提是 loop 得先存在。不是组织结构图上的流程。是真实可观察、可执行、可反馈的 loop。

没有这个东西,agent 只能在旁边生成更多 summary、更多 plan、更多看起来很忙的 motion。却不一定更快。

Throughput 才是新的体感

我现在更愿意这样看 AI adoption:不要先问 model 多聪明。先问 work 有没有变快。

不是一个人感觉爽了多少,而是一个 team、一个 function、一个 company 的 cycle time 有没有被压缩。

在这一切能发生之前,得先把自己的 work 变得 agent-legible。

这件事比换 model 难多了,也真实多了,是不是?

This post is licensed under CC BY 4.0 by the author.