Six months of LLM progress: coding agents grew up, laptop models punched above their weight

Simon Willison’s PyCon US 2026 lightning talk frames November 2025 as an inflection point for LLMs, especially for coding. The frontier crown changed hands five times across Claude Sonnet 4.5, GPT-5.1, Gemini 3, GPT-5.1 Codex Max, and Claude Opus 4.5, but the more consequential shift was that coding agents finally crossed the threshold from often-working to reliably-working. The result of OpenAI and Anthropic’s year-long investment in Reinforcement Learning from Verifiable Rewards, paired with Codex and Claude Code harnesses, made these agents usable as daily drivers rather than novelty toys.

The holiday window that followed produced a wave of over-ambitious experiments, including Willison’s own retired projects like a JavaScript-in-Python port. February brought the breakout of OpenClaw, a personal AI assistant whose category — dubbed “Claws” — drove a run on Mac Minis as people sought local hardware to host them. Willison nods to the obvious risk metaphor: powerful, semi-autonomous assistants are fine until something breaks the guardrails.

The second major theme is the rapid closing of the gap between frontier models and what runs locally. Gemini 3.1 Pro and Google’s animated multi-animal demos suggest the labs are now optimizing for whimsical generative tasks, while open-weight entrants like GLM-5.1 and a 20.9GB Qwen3.6-35B-A3B variant produced laptop-grade output that, by Willison’s pelican benchmark, rivals or beats much larger hosted models — strong enough that he concedes the benchmark itself is approaching obsolescence.