Websites Increasingly Tell Apple and AI Companies to Stop Scraping

Wired reported today that many large websites are blocking Applebot-Extended, Apple’s artificial intelligence (AI) web crawler. Wired determined this by examining the sites’ public robots.txt file, which Apple says it respects, but some AI companies don’t. According to its research:

WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training. The cold reception reflects a significant shift in both the perception and use of the robotic crawlers that have trawled the web for decades. Now that these bots play a key role in collecting AI training data, they’ve become a conflict zone over intellectual property and the future of the web.

With the release of Apple Intelligence around the corner, I suppose it makes sense to single out Apple here, but this is not really news. A study in July that Kevin Rouse wrote about for The New York Times concluded that websites are blocking web crawlers from all AI companies at a dramatic rate:

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

These numbers don’t seem to account for websites using server-side methods of blocking crawlers or Cloudflare’s tool, which could mean the decline in available data is underreported.

Still, it’s interesting to see more and more websites evaluate the tradeoffs of allowing AI crawlers to scrape their sites and decide they’re not worth it. However, I wouldn’t be surprised if the media companies that cut deals with OpenAI and others are contractually obligated to block competing crawlers.

I’d also point out that it’s disingenuous of Apple to tell Wired that Applebot-Extended is a way to respect publishers’ rights when the company didn’t offer publishers the chance opt out until after it had scraped the entire web. However, therein lies the explanation of why so many sites have blocked Applebot-Extended since WWDC I suppose.

What’s unclear is how this all shakes out. Big media companies are hedging their bets by making deals with the likes of OpenAI and Perplexity in case Google search continues its decline and is replaced by chatbots. Whether those are good bets or not remains to be seen, but at least they offer some short-term cash flow and referral traffic in what has been a prolonged drought for the media industry.

For websites that don’t make deals or are too small for AI companies to care about, I can see a scenario where some play along anyway, allowing their sites to be scraped for little or no upside. For those sites that choose to stay outside the AI silos, it’s easy to paint a bleak picture, but the Internet is resilient, and I have a feeling that the Open Web will find a path forward in the end.