Orthrus claims 7.8× token throughput on Qwen3 with bit-exact output parity
Original source
Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution
Hacker News →Orthrus is a dual-architecture inference framework for Qwen3 that bolts parallel diffusion-style decoding onto a standard autoregressive LLM without altering its output distribution. The authors report up to 7.8× more tokens per forward pass on generation workloads and roughly 6× end-to-end speedup over the Qwen3-8B baseline on MATH-500, while remaining strictly lossless via an intra-model consensus mechanism that verifies parallel proposals against the base model’s exact predictions.
The design sidesteps two perennial problems with parallel decoding. Unlike speculative decoding approaches such as EAGLE-3 and DFlash, both views share the same KV cache, eliminating draft-model memory overhead and raising token acceptance rates as context grows. Unlike pure diffusion LLMs, which tend to drift and lose accuracy on reasoning, Orthrus decouples parallel generation from sequential constraints. Only 16% of parameters are fine-tuned; the base model stays frozen.
Code and checkpoints are published on GitHub with a Hugging Face model (chiennv/Orthrus-Qwen3-8B) and a flash-attention-based runtime; native vLLM and SGLang integrations are flagged as forthcoming. The accompanying paper is on arXiv (2605.12825).
Read the full article
Continue reading at Hacker News →This is an AI-generated summary. Read the original for the full story.