Model benchmarks keep climbing

AI · 09 JUN 2026

Observation

Benchmark scores continue to rise, but an increasing share of gains comes from evaluation-specific tuning and prompt scaffolding. Independent replications often show smaller deltas than launch announcements.

Narrative

AI capability is accelerating without limit. Each new model is a step change.

Alternative View

Progress may be real but uneven. Headline benchmarks may measure familiarity with test formats as much as general capability.

Unknowns

How much of reported gains survive contamination-free evaluation? Which capabilities transfer to unscaffolded real-world tasks? What would a saturated benchmark regime look like?

Question

If all public benchmarks were retired tomorrow, how would we know models are improving?

Why It Matters

Capital allocation, hiring, and policy all key off a small set of numbers whose meaning is drifting.