Over the past several days, we’ve made some changes at MacStories to address the ingestion of our work by web crawlers operated by artificial intelligence companies. We’ve learned a lot, so we thought we’d share what we’ve done in case anyone else would like to do something similar.
If you readMacStoriesregularly, or listen to ourpodcasts, you already know that Federico and I think that crawling the Open Web to train large language models is unethical. Industry-wide, AI companies have scraped the content of websites like ours, using it as the raw material for their chatbots and other commercial products without the consent or compensation of publishers and other creators.
Now that the horse is out of the barn, some of those companies are respecting publishers’ robots.txt files, while others seemingly aren’t. That doesn’t make up for the tens of thousands of articles and images that have already been scraped from MacStories. Nor is robots.txt a complete solution, so it’s just one of four approaches we’re taking to protect our work.
Dan Moren has an excellent guide on Six Colors that explains how to exclude your website from the web crawlers used by Apple, OpenAI, and others to train large language models for their AI products. For many sites, the process simply requires a few edits to the robots.txt file on your server:
If you’re not familiar with robots.txt, it’s a text file placed at the root of a web server that can give instructions about how automated web crawlers are allowed to interact with your site. This system enables publishers to not only entirely block their sites from crawlers, but also specify just parts of the sites to allow or disallow.
The process is a little more complicated with something like a WordPress, which MacStories uses, and Dan covers that too.
Unfortunately, as Dan explains, editing robots.txt isn’t a solution for companies that ignore the file. It’s simply a convention that doesn’t carry any legal or regulatory weight. Nor does it help with Google or Microsoft’s use of your website’s copyrighted content unless you’re also willing to remove your site from the biggest search engines.
Although I’m glad there is a way to block at least some AI web crawlers prospectively, it’s cold comfort. We and many sites have years of articles that have already been crawled to train these models, and you can’t unring that bell. That said, MacStories’ robot.txt file has been updated to ban Apple and OpenAI’s crawlers, and we’re investigating additional server-level protections.
If you listen to Ruminate or follow my writing on MacStories, you know that I think what these companies are doing is wrong both in the moral and legal sense of the word. However, nothing captures it quite as well as this Mastodon post by Federico today:
If you’ve ever read the principles that guide us at MacStories, I’m sure Federico’s post came as no surprise. We care deeply about the Open Web, but ‘open’ doesn’t give tech companies free rein to appropriate our work to build their products.
Yesterday, Federico linked to Apple’s Machine Learning Research website where it was disclosed that the company has indexed the web to train its model without the consent of publishers. I was as disappointed in Apple as Federico. I also immediately thought of this 2010 clip of Steve Jobs near the end of his life, reflecting on what ‘the intersection of Technology and the Liberal Arts’ meant to Apple:
I’ve always loved that clip. It speaks to me as someone who loves technology and creates things for the web. In hindsight, I also think that Jobs was explaining what he hoped his legacy would be. It’s ironic that he spoke about ‘technology married with Liberal Arts,’ which superficially sounds like what Apple and others have done to create their AI models but couldn’t be further from what he meant. It’s hard to watch that clip now and not wonder if Apple has lost sight of what guided it in 2010.