Posts tagged with "Apple Intelligence"

Opting Out of AI Model Training

Dan Moren has an excellent guide on Six Colors that explains how to exclude your website from the web crawlers used by Apple, OpenAI, and others to train large language models for their AI products. For many sites, the process simply requires a few edits to the robots.txt file on your server:

If you’re not familiar with robots.txt, it’s a text file placed at the root of a web server that can give instructions about how automated web crawlers are allowed to interact with your site. This system enables publishers to not only entirely block their sites from crawlers, but also specify just parts of the sites to allow or disallow.

The process is a little more complicated with something like a WordPress, which MacStories uses, and Dan covers that too.

Unfortunately, as Dan explains, editing robots.txt isn’t a solution for companies that ignore the file. It’s simply a convention that doesn’t carry any legal or regulatory weight. Nor does it help with Google or Microsoft’s use of your website’s copyrighted content unless you’re also willing to remove your site from the biggest search engines.

Although I’m glad there is a way to block at least some AI web crawlers prospectively, it’s cold comfort. We and many sites have years of articles that have already been crawled to train these models, and you can’t unring that bell. That said, MacStories’ robot.txt file has been updated to ban Apple and OpenAI’s crawlers, and we’re investigating additional server-level protections.

If you listen to Ruminate or follow my writing on MacStories, you know that I think what these companies are doing is wrong both in the moral and legal sense of the word. However, nothing captures it quite as well as this Mastodon post by Federico today:

If you’ve ever read the principles that guide us at MacStories, I’m sure Federico’s post came as no surprise. We care deeply about the Open Web, but ‘open’ doesn’t give tech companies free rein to appropriate our work to build their products.

Yesterday, Federico linked to Apple’s Machine Learning Research website where it was disclosed that the company has indexed the web to train its model without the consent of publishers. I was as disappointed in Apple as Federico. I also immediately thought of this 2010 clip of Steve Jobs near the end of his life, reflecting on what ‘the intersection of Technology and the Liberal Arts’ meant to Apple:

I’ve always loved that clip. It speaks to me as someone who loves technology and creates things for the web. In hindsight, I also think that Jobs was explaining what he hoped his legacy would be. It’s ironic that he spoke about ‘technology married with Liberal Arts,’ which superficially sounds like what Apple and others have done to create their AI models but couldn’t be further from what he meant. It’s hard to watch that clip now and not wonder if Apple has lost sight of what guided it in 2010.


You can follow all of our WWDC coverage through our WWDC 2024 hub or subscribe to the dedicated WWDC 2024 RSS feed.

Permalink

Apple Details Its AI Foundation Models and Applebot Web Scraping

From Apple’s Machine Learning Research1 blog:

Our foundation models are trained on Apple’s AXLearn framework, an open-source project we released in 2023. It builds on top of JAX and XLA, and allows us to train the models with high efficiency and scalability on various training hardware and cloud platforms, including TPUs and both cloud and on-premise GPUs. We used a combination of data parallelism, tensor parallelism, sequence parallelism, and Fully Sharded Data Parallel (FSDP) to scale training along multiple dimensions such as data, model, and sequence length.

We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training with a data usage control.

We never use our users’ private personal data or user interactions when training our foundation models, and we apply filters to remove personally identifiable information like social security and credit card numbers that are publicly available on the Internet. We also filter profanity and other low-quality content to prevent its inclusion in the training corpus. In addition to filtering, we perform data extraction, deduplication, and the application of a model-based classifier to identify high quality documents.

It’s a very technical read, but it shows how Apple approached building AI features in their products and how their on-device and server models compare to others in the industry (on servers, Apple claims their model is essentially neck and neck with GPT-4-Turbo, OpenAI’s older model).

This blog post, however, pretty much parallels my reaction to the WWDC keynote. Everything was fun and cool until they showed generative image creation that spits out slop “resembling” (strong word) other people; and in this post, everything was cool until they mentioned how – surprise! – Applebot had already indexed web content to train their model without publishers’ consent, who can only opt out now. (This was also confirmed by Apple executives elsewhere.)

As a creator and website owner, I guess that these things will never sit right with me. Why should we accept that certain data sets require a licensing fee but anything that is found “on the open web” can be mindlessly scraped, parsed, and regurgitated by an AI? Web publishers (and especially indie web publishers these days, who cannot afford lawsuits or hiring law firms to strike expensive deals) deserve better.

It’s disappointing to see Apple muddy an otherwise compelling set of features (some of which I really want to try) with practices that are no better than the rest of the industry.


  1. How long until this become the ‘Apple Intelligence Research’ website? ↩︎
Permalink

Apple Intelligence: The MacStories Overview

After months of anticipation and speculation about what Apple could be doing in the world of artificial intelligence, we now have our first glimpse at the company’s approach: Apple Intelligence. Based on generative models, Apple Intelligence uses a combination of on-device and cloud processing to offer intelligence features that are personalized, useful, and secure. In today’s WWDC keynote, Tim Cook went so far as to call it “the next big step for Apple.”

From the company’s press release on Apple Intelligence:

“We’re thrilled to introduce a new chapter in Apple innovation. Apple Intelligence will transform what users can do with our products — and what our products can do for our users,” said Tim Cook, Apple’s CEO. “Our unique approach combines generative AI with a user’s personal context to deliver truly helpful intelligence. And it can access that information in a completely private and secure way to help users do the things that matter most to them. This is AI as only Apple can deliver it, and we can’t wait for users to experience what it can do.”

It’s clear from today’s presentation that Apple is positioning itself as taking a different approach to AI than the rest of the industry. The company is putting generative models at the core of its devices while seeking to stay true to its principles. And that starts with privacy.

Read more