340+ Local News Sites Block Internet Archive Over AI Scraping Fears

More than 340 local US news outlets have added the Internet Archive’s crawlers to their robots.txt blocklists, up from 241 sites in January, according to a new Nieman Lab analysis. The wave is driven primarily by five of the country’s seven largest local news chains — USA Today Co., McClatchy, Advance Local, MediaNews Group, and Tribune Publishing — with the latter two owned by hedge fund Alden Global Capital. Publishers say they’re guarding their intellectual property against AI training scrapes, though none have produced evidence that an AI company actually harvested their content through the Wayback Machine.

The blocked bots include Heritrix, Archive-It, and archive.org_bot, among others. Internet Archive founder Mark Graham points to rate-limiting and Cloudflare-based bot monitoring as guardrails, and notes the terms of use restrict collections to scholarship and research. Advance Local confirmed it preemptively hard-blocked the Archive last August as part of broader anti-scraping policy, not a response to any specific incident.

The collateral damage falls on journalists, historians, and researchers who depend on the Wayback Machine to reconstruct reporting from defunct or gutted local papers — particularly in news deserts where archived stories may be the only surviving record. Critics frame the moment as the latest round in a long-running fight between an information-should-be-free institution and rights-holders, now accelerated by the economics of generative AI.