Opting Out of AI Model Training

Dan Moren has an excellent guide on Six Colors that explains how to exclude your website from the web crawlers used by Apple, OpenAI, and others to train large language models for their AI products. For many sites, the process simply requires a few edits to the robots.txt file on your server:

If you’re not familiar with robots.txt, it’s a text file placed at the root of a web server that can give instructions about how automated web crawlers are allowed to interact with your site. This system enables publishers to not only entirely block their sites from crawlers, but also specify just parts of the sites to allow or disallow.

The process is a little more complicated with something like a WordPress, which MacStories uses, and Dan covers that too.

Unfortunately, as Dan explains, editing robots.txt isn’t a solution for companies that ignore the file. It’s simply a convention that doesn’t carry any legal or regulatory weight. Nor does it help with Google or Microsoft’s use of your website’s copyrighted content unless you’re also willing to remove your site from the biggest search engines.

Although I’m glad there is a way to block at least some AI web crawlers prospectively, it’s cold comfort. We and many sites have years of articles that have already been crawled to train these models, and you can’t unring that bell. That said, MacStories’ robot.txt file has been updated to ban Apple and OpenAI’s crawlers, and we’re investigating additional server-level protections.

If you listen to Ruminate or follow my writing on MacStories, you know that I think what these companies are doing is wrong both in the moral and legal sense of the word. However, nothing captures it quite as well as this Mastodon post by Federico today:

If you’ve ever read the principles that guide us at MacStories, I’m sure Federico’s post came as no surprise. We care deeply about the Open Web, but ‘open’ doesn’t give tech companies free rein to appropriate our work to build their products.

Yesterday, Federico linked to Apple’s Machine Learning Research website where it was disclosed that the company has indexed the web to train its model without the consent of publishers. I was as disappointed in Apple as Federico. I also immediately thought of this 2010 clip of Steve Jobs near the end of his life, reflecting on what ‘the intersection of Technology and the Liberal Arts’ meant to Apple:

I’ve always loved that clip. It speaks to me as someone who loves technology and creates things for the web. In hindsight, I also think that Jobs was explaining what he hoped his legacy would be. It’s ironic that he spoke about ‘technology married with Liberal Arts,’ which superficially sounds like what Apple and others have done to create their AI models but couldn’t be further from what he meant. It’s hard to watch that clip now and not wonder if Apple has lost sight of what guided it in 2010.

You can follow all of our WWDC coverage through our WWDC 2024 hub or subscribe to the dedicated WWDC 2024 RSS feed.

MusicHarbor’s Latest Update Creates a Richer Music Experience with News and More

How I Fixed Switching Between Safari Profiles with BetterTouchTool and a Hyper Key

Terminology 5: Rebuilt and Better than Ever

Opting Out of AI Model Training →