Qwen3.6-35B beats Claude Opus 4.7 at Willison's pelican SVG benchmark

Simon Willison’s long-running “pelican riding a bicycle” SVG test produced an unexpected result: a 21GB quantized Qwen3.6-35B-A3B model running locally on a MacBook Pro M5 via LM Studio generated a cleaner illustration than Anthropic’s freshly released Claude Opus 4.7. Opus botched the bicycle frame, and enabling max thinking level did not save it. A backup flamingo-on-a-unicycle prompt, held in reserve to guard against benchmark-training suspicions, went to Qwen as well-complete with a self-aware SVG comment flagging the sunglasses.

Willison is explicit that the benchmark is a gag about the absurdity of model comparison, but notes it has historically tracked general model quality fairly well. That correlation has now broken. He does not believe a quantized open-weights model is genuinely more capable than Opus 4.7 for real work; this is a narrow win on one deliberately silly task.

The interesting signal is less “Qwen beats Opus” and more that vibes-based single-prompt benchmarks are decoupling from utility as frontier and open models converge on competence. Local-laptop inference producing publishable-quality illustrations also underscores how far quantized open weights have come in roughly 18 months.