RC RANDOM CHAOS

OpenAI retires SWE-bench Verified as a frontier coding benchmark

· via Hacker News

Original source

SWE-bench Verified no longer measures frontier coding capabilities

Hacker News →

OpenAI says SWE-bench Verified has saturated as a measure of frontier coding ability. Top models now cluster near the ceiling, leaving little headroom to distinguish capability gains between successive releases — a familiar pattern as benchmarks age past the models they were built to stress.

The shift matters because SWE-bench Verified has been treated as a load-bearing signal for agentic coding progress across the industry. When the leaderboard compresses, score deltas stop tracking real-world engineering competence, and vendors lean harder on cherry-picked tasks or bespoke harnesses to differentiate. Expect a pivot toward longer-horizon, multi-repo, and production-grade evaluations that exercise planning, tool use, and recovery from failure rather than single-issue patch generation.

Read the full article

Continue reading at Hacker News →

This is an AI-generated summary. Read the original for the full story.