OpenAI's voice AI stack: how they squeeze latency out of real-time speech

OpenAI’s engineering write-up walks through the infrastructure behind its real-time voice models, focused on keeping end-to-end latency low enough for natural conversation. The pipeline collapses traditional ASR-LLM-TTS chains into a unified speech-to-speech model, removing the round-trip cost of transcribing audio, generating text, then re-synthesizing speech. Streaming inference, GPU scheduling, and network path optimization are the levers being pulled.

At scale, the bottleneck shifts from raw model speed to tail latency under concurrent load. OpenAI describes per-region GPU pools, custom routing to minimize geographic hops, and aggressive batching strategies that preserve sub-second time-to-first-audio while keeping utilization high. Audio chunking and incremental decoding let the model start speaking before a full response is generated.

The significance is less about a single technical trick and more about voice AI maturing into production infrastructure. Real-time multimodal models are now an operations problem — capacity planning, regional failover, and latency SLOs — not just a research artifact. Competitors building voice agents will face the same tradeoffs around batching, routing, and the cost of running speech-native models at conversational pace.