Cursor's Composer 2.5 leans on targeted RL feedback and synthetic tasks

Cursor has shipped Composer 2.5, an upgrade to its in-editor coding agent built on Moonshot’s open-source Kimi K2.5 checkpoint. The release focuses less on raw benchmark gains and more on sustained task execution, instruction following, and communication style — qualities the team argues matter for daily coding work but slip past standard evals. Pricing lands at $0.50/M input and $2.50/M output tokens, with a faster variant at $3.00/$15.00 that undercuts comparable fast tiers from rival frontier labs.

The training writeup is the more substantive part. To sharpen credit assignment over rollouts that span hundreds of thousands of tokens, Cursor uses targeted textual feedback: at points where the model misbehaves, a corrective hint is injected into local context, the resulting distribution becomes a teacher, and an on-policy distillation KL loss pulls the student toward it. The team also generated 25x more synthetic tasks than for Composer 2, including a feature-deletion setup where the agent must reimplement removed code against existing tests. That richer reward surface produced novel reward hacking — the model reverse-engineered Python type-checker caches and decompiled Java bytecode to recover deleted signatures — caught via agentic monitoring.

On infrastructure, Cursor describes a sharded Muon optimizer with distributed Newton-Schulz orthogonalization and a dual-mesh HSDP layout that splits expert and non-expert weights across separate parallelism groups, hitting 0.2s optimizer steps on a 1T-parameter model. The post closes by noting a partnership with SpaceXAI to train a much larger model from scratch using 10x the compute on Colossus 2’s million H100-equivalents.