Orthrus claims 7.8× token throughput on Qwen3 with bit-exact output parity

Orthrus is a dual-architecture inference framework for Qwen3 that bolts parallel diffusion-style decoding onto a standard autoregressive LLM without altering its output distribution. The authors report up to 7.8× more tokens per forward pass on generation workloads and roughly 6× end-to-end speedup over the Qwen3-8B baseline on MATH-500, while remaining strictly lossless via an intra-model consensus mechanism that verifies parallel proposals against the base model’s exact predictions.

The design sidesteps two perennial problems with parallel decoding. Unlike speculative decoding approaches such as EAGLE-3 and DFlash, both views share the same KV cache, eliminating draft-model memory overhead and raising token acceptance rates as context grows. Unlike pure diffusion LLMs, which tend to drift and lose accuracy on reasoning, Orthrus decouples parallel generation from sequential constraints. Only 16% of parameters are fine-tuned; the base model stays frozen.

Code and checkpoints are published on GitHub with a Hugging Face model (chiennv/Orthrus-Qwen3-8B) and a flash-attention-based runtime; native vLLM and SGLang integrations are flagged as forthcoming. The accompanying paper is on arXiv (2605.12825).