RC RANDOM CHAOS

Talkie: a 13B language model trained only on pre-1931 English text

· via Simon Willison

Original source

Introducing talkie: a 13B vintage language model from 1930

Simon Willison →

Nick Levine, David Duvenaud, and Alec Radford have released talkie-1930-13b, a 13B-parameter language model trained on 260B tokens of pre-1931 English text, alongside an instruction-tuned variant for chat. Both checkpoints ship under Apache 2.0. Because the US copyright cutoff currently sits at January 1, 1931, the base model’s corpus is entirely out of copyright — qualifying it as what Simon Willison calls a ‘vegan model,’ trained only on licensed or public-domain data. The team has signaled intent to publish the corpus or reproduction scripts later.

The research framing is the interesting part: the authors want to probe whether a model frozen at a historical knowledge cutoff can predict subsequent events, independently rediscover scientific results (e.g., could a 1911-trained model derive General Relativity?), or learn to program from few-shot examples. Avoiding contamination from post-1931 text in training data is treated as a primary engineering challenge.

The chat variant breaks the ‘vegan’ purity, however. Instruction-following was bootstrapped using Claude Sonnet 4.6 as a DPO judge and Claude Opus 4.6 as a synthetic conversation partner, importing anachronistic behaviors (an earlier 7B version reportedly started replying in listicles). The team’s stated goal is to eventually use vintage base models as their own judges to close that loop. Willison’s pelican-on-a-bicycle SVG test, predictably, returned a period-flavored prose anecdote rather than vector graphics.

Read the full article

Continue reading at Simon Willison →

This is an AI-generated summary. Read the original for the full story.