Open-weight models close the gap
Observation
The measured gap between open-weight and frontier proprietary models has narrowed on public evaluations, while the gap in deployed product quality is harder to measure and rarely reported.
Narrative
Open models have caught up. Paying for frontier access is unnecessary.
Alternative View
Parity on public tests may coexist with meaningful gaps in reliability, tooling and long-tail behaviour that only show up in production.
Unknowns
How large is the gap on private, uncontaminated tasks? Who funds open-weight training runs, and why? Does the gap close, oscillate, or widen with each generation?
Question
What would you need to measure to know if two models are actually equivalent for your use?
Why It Matters
Infrastructure choices made on benchmark parity are expensive to reverse.