whichllm: CLI picks the best local LLM for your GPU using real benchmarks

whichllm is a command-line tool that auto-detects a machine’s GPU, CPU, and RAM and ranks HuggingFace models that will actually run on it. Unlike size-based fit checkers, it sorts candidates by merged scores from LiveBench, Artificial Analysis, Aider, Chatbot Arena, and the Open LLM Leaderboard, so a newer 27B model can outrank a 32B that also fits. VRAM estimates account for weights, GQA KV cache, activations, and overhead, while throughput modeling factors in quantization, backend, MoE active-vs-total parameters, and unified-memory partial offload.

The ranking pipeline grades every benchmark by evidence quality — direct match, variant, base model, interpolated, or uploader-reported — and discounts accordingly. Cross-family inheritance is rejected when parameter counts diverge more than 2x, blocking small forks from inheriting their base model’s scores. Stale leaderboards are demoted along each model’s lineage so 2024 results cannot dominate current-generation ones, and the snapshot date is printed with every recommendation.

Beyond ranking, whichllm can simulate arbitrary GPUs for purchase planning, reverse-lookup hardware needed for a specific model, download and chat with the top pick via uv-managed environments, emit ready-to-run Python snippets, or stream JSON for shell pipelines feeding Ollama. It ships via uvx, Homebrew, and pip, and supports GGUF, AWQ, GPTQ, and FP16/BF16 backends.