What N tokens/second actually feels like: an interactive demo
Local-LLM benchmarks throw around throughput numbers — 47 tok/s on an M3, 500 tok/s on Groq — but raw figures don’t convey what streaming output at those rates actually looks like. This browser tool renders simulated token streams at adjustable speeds across four modes: syntax-highlighted code, prose, reasoning-model think-aloud, and agent output mixing tool calls with code generation.
The payoff is perceptual rather than analytical. Reading at 5 tok/s feels like a Raspberry Pi local model; 60 tok/s matches typical hosted Claude or GPT output; 200 tok/s is Groq territory; 800 tok/s pushes past what eyes can track. Toggling between code and prose at identical rates exposes the gap the tool was built to demonstrate — code is far more token-dense than English, so the same throughput number feels dramatically different depending on content.
The demo uses BPE-style approximation rather than any vendor tokenizer, since tiktoken, Claude’s encoder, and others disagree on specifics anyway. Long identifiers split into multiple tokens, punctuation counts, and English prose averages around 1.3 tokens per word — meaning 30 tok/s translates to roughly 23 words per second of readable output.
Read the full article
Continue reading at Hacker News →This is an AI-generated summary. Read the original for the full story.