Model benchmarks keep climbing
Benchmark scores continue to rise, but an increasing share of gains comes from evaluation-specific tuning and prompt scaffolding. Independent replications often show smaller deltas than launch announcements.
AI capability is accelerating without limit. Each new model is a step change.
Progress may be real but uneven. Headline benchmarks may measure familiarity with test formats as much as general capability.
How much of reported gains survive contamination-free evaluation? Which capabilities transfer to unscaffolded real-world tasks? What would a saturated benchmark regime look like?
If all public benchmarks were retired tomorrow, how would we know models are improving?
Capital allocation, hiring, and policy all key off a small set of numbers whose meaning is drifting.