AI Models Exposed as Poor Gamblers in Premier League Betting Study

A benchmarking study called KellyBench, run by London-based startup General Reasoning, pitted eight leading AI systems against the 2023-24 Premier League season in a simulated betting environment. Each model received detailed historical match data, team statistics, and rolling player updates, then had to build risk-managed betting strategies across an entire season - three attempts each, no internet access.

Every major model lost money overall. Anthropic’s Claude Opus 4.6 came closest to breaking even, averaging an 11 percent loss with one near-profitable run. xAI’s Grok 4.20 performed worst, going bankrupt in one attempt and failing to finish the other two. Google’s Gemini 3.1 Pro showed wild variance - a 34 percent gain on one try, bankruptcy on another.

The results highlight a persistent blind spot: AI systems that excel at structured tasks like code generation still struggle with sustained real-world reasoning under uncertainty. Betting markets demand dynamic adaptation to shifting conditions over long time horizons, a capability current models clearly lack despite their strong performance on narrower benchmarks.