This Week's Sponsor:

PowerPhotos

The Ultimate Toolbox for Photos on the Mac


Posts tagged with "web"

Websites Increasingly Tell Apple and AI Companies to Stop Scraping

Wired reported today that many large websites are blocking Applebot-Extended, Apple’s artificial intelligence (AI) web crawler. Wired determined this by examining the sites’ public robots.txt file, which Apple says it respects, but some AI companies don’t. According to its research:

WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training. The cold reception reflects a significant shift in both the perception and use of the robotic crawlers that have trawled the web for decades. Now that these bots play a key role in collecting AI training data, they’ve become a conflict zone over intellectual property and the future of the web.

With the release of Apple Intelligence around the corner, I suppose it makes sense to single out Apple here, but this is not really news. A study in July that Kevin Rouse wrote about for The New York Times concluded that websites are blocking web crawlers from all AI companies at a dramatic rate:

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

These numbers don’t seem to account for websites using server-side methods of blocking crawlers or Cloudflare’s tool, which could mean the decline in available data is underreported.

Still, it’s interesting to see more and more websites evaluate the tradeoffs of allowing AI crawlers to scrape their sites and decide they’re not worth it. However, I wouldn’t be surprised if the media companies that cut deals with OpenAI and others are contractually obligated to block competing crawlers.

I’d also point out that it’s disingenuous of Apple to tell Wired that Applebot-Extended is a way to respect publishers’ rights when the company didn’t offer publishers the chance opt out until after it had scraped the entire web. However, therein lies the explanation of why so many sites have blocked Applebot-Extended since WWDC I suppose.

What’s unclear is how this all shakes out. Big media companies are hedging their bets by making deals with the likes of OpenAI and Perplexity in case Google search continues its decline and is replaced by chatbots. Whether those are good bets or not remains to be seen, but at least they offer some short-term cash flow and referral traffic in what has been a prolonged drought for the media industry.

For websites that don’t make deals or are too small for AI companies to care about, I can see a scenario where some play along anyway, allowing their sites to be scraped for little or no upside. For those sites that choose to stay outside the AI silos, it’s easy to paint a bleak picture, but the Internet is resilient, and I have a feeling that the Open Web will find a path forward in the end.

Permalink

Apple Maps Launches in Beta on the Web

Today, Apple has launched Apple Maps on the web in a surprise announcement. This beta version of Apple Maps on the web is accessible via the url beta.maps.apple.com, and is said by the company to be compatible with Google Chrome, Safari, and Microsoft Edge on Windows. Additionally, developers will now be able to link out to Apple Maps on the web using MapKit JS.

Apple Maps on the web seems to be rather limited so far. The web app supports panning and zooming on the map, searching and tapping on locations, looking up directions, and browsing curated guides. However, it isn’t currently possible to tilt the map to view 3D building models or terrain elevation, and directions are limited to Driving and Walking. Look Around (Apple’s equivalent to Google Street View) is not available on the web either, but Apple says the feature will arrive in the coming months.

The web UI itself is reminiscent of Apple Maps on macOS and iPadOS. Recent locations can be found in a sidebar, and buttons to navigate the map are located in the top-right and bottom-right-hand corners of the page.

Just like on macOS and iPadOS, location details open in a collapsible sidebar.

Just like on macOS and iPadOS, location details open in a collapsible sidebar.

Curated guides and satellite imagery are also supported in Apple Maps on the web.

Curated guides and satellite imagery are also supported in Apple Maps on the web.

Directions are limited to Driving and Walking.

Directions are limited to Driving and Walking.

In my testing, performance across Apple Maps on the web isn’t stellar in Safari. I’m observing stutters in transition animations, as well as when panning the map. In Google Chrome, however, the web app feels significantly smoother. If you attempt to access Apple Maps from Firefox, the app will not load and redirect you to Apple’s (short) list of supported browsers. The same message is displayed if you access the URL from Safari on iOS.

Firefox isn't supported yet.

Firefox isn’t supported yet.

Apple Maps on the web is a welcome addition. Google Maps has always been available on the web for all to use, and I’m glad to finally see Apple try and compete beyond its native apps on iOS, iPadOS, and macOS. Hopefully more languages and features are coming to the web version soon.


Wired Confirms Perplexity Is Bypassing Efforts by Websites to Block Its Web Crawler

Last week, Federico and I asked Robb Knight to do what he could to block web crawlers deployed by artificial intelligence companies from scraping MacStories. Robb had already updated his own site’s robots.txt file months ago, so that’s the first thing he did for MacStories.

However, robots.txt only works if a company’s web crawler is set up to respect the file. As I wrote earlier this week, a better solution is to block them on your server, which Robb did on his personal site and wrote about late last week. The setup sends a 403 error if one of the bots listed in his server code requests information from his site.

Spoiler: Robb hit the nail on the head the first time.

Spoiler: Robb hit the nail on the head the first time.

After reading Robb’s post, Federico and I asked him to do the same for MacStories, which he did last Saturday. Once it was set up, Federico began testing the setup. OpenAI returned an error as expected, but Perplexity’s bot was still able to reach MacStories, which shouldn’t have been the case.1

Yes, I took a screenshot of Perplexity's API documentation because I bet it changes based on what we discovered.

Yes, I took a screenshot of Perplexity’s API documentation because I bet it changes based on what we discovered.

That began a deep dive to try to figure out what was going on. Robb’s code checked out, blocking the user agent specified in Perplexity’s own API documentation. What we discovered after more testing was that Perplexity was hitting MacStories’ server without using the user agent it said it used, effectively doing an end run around Robb’s server code.

Robb wrote up his findings on his website, which promptly shot to the top slot on Hacker News and caught the eye of Dhruv Mehrotra and Tim Marchman of Wired, who were in the midst of investigating how Perplexity works. As Mehrotra and Marchman describe it:

A WIRED analysis and one carried out by developer Robb Knight suggest that Perplexity is able to achieve this partly through apparently ignoring a widely accepted web standard known as the Robots Exclusion Protocol to surreptitiously scrape areas of websites that operators do not want accessed by bots, despite claiming that it won’t. WIRED observed a machine tied to Perplexity—more specifically, one on an Amazon server and almost certainly operated by Perplexity—doing this on wired.com and across other Condé Nast publications.

Until earlier this week, Perplexity published in its documentation a link to a list of the IP addresses its crawlers use—an apparent effort to be transparent. However, in some cases, as both WIRED and Knight were able to demonstrate, it appears to be accessing and scraping websites from which coders have attempted to block its crawler, called Perplexity Bot, using at least one unpublicized IP address. The company has since removed references to its public IP pool from its documentation.

That secret IP address—44.221.181.252—has hit properties at Condé Nast, the media company that owns WIRED, at least 822 times in the last three months. One senior engineer at Condé Nast, who asked not to be named because he wants to “stay out of it,” calls this a “massive undercount” because the company only retains a fraction of its network logs.

WIRED verified that the IP address in question is almost certainly linked to Perplexity by creating a new website and monitoring its server logs. Immediately after a WIRED reporter prompted the Perplexity chatbot to summarize the website’s content, the server logged that the IP address visited the site. This same IP address was first observed by Knight during a similar test.

This sort of unethical behavior is why we took the steps we did to block the use of MacStories’ websites as training data for Perplexity and other companies.2 Incidents like this and the lack of transparency about how AI companies train their models have led to a lot of mistrust in the entire industry among creators who publish on the web. I’m glad we’ve been able to play a small part in revealing Perplexity’s egregious behavior, but more needs to be done to rein in this sort of behavior, including closer scrutiny by regulators around the world.

As a footnote to this, it’s worth noting that Wired also puts to rest the argument that websites should be okay with Perplexity’s behavior because they include citations in their plagiarism. According to Wired’s story:

WIRED’s own records show that Perplexity sent 1,265 referrals to wired.com in May, an insignificant amount in the context of the site’s overall traffic. The article to which the most traffic was referred got 17 views.

That’s next to nothing for a site with Wired’s traffic, which Similarweb and other sites peg at over 20 million page views that same month. That’s a mere 0.006% of Wired’s May traffic. Let that sink in, and then ask yourself whether it seems like a fair trade.


  1. Meanwhile, I was digging through bins of old videogames and hardware at a Retro Gaming Festival doing ‘research’ for NPC↩︎
  2. Mehrotra and Marchman correctly question whether Perplexity is even an AI company because they piggyback on other company’s LLMs and use them in conjunction with scraped web data to provide summaries that effectively replace the source’s content. However, that doesn’t change the fact that Perplexity is surreptitiously scraping sites while simultaneously professing to respect sites’ robot.txt file. That’s the unethical bit. ↩︎

How We’re Trying to Protect MacStories from AI Bots and Web Crawlers – And How You Can, Too

Over the past several days, we’ve made some changes at MacStories to address the ingestion of our work by web crawlers operated by artificial intelligence companies. We’ve learned a lot, so we thought we’d share what we’ve done in case anyone else would like to do something similar.

If you read MacStories regularly, or listen to our podcasts, you already know that Federico and I think that crawling the Open Web to train large language models is unethical. Industry-wide, AI companies have scraped the content of websites like ours, using it as the raw material for their chatbots and other commercial products without the consent or compensation of publishers and other creators.

Now that the horse is out of the barn, some of those companies are respecting publishers’ robots.txt files, while others seemingly aren’t. That doesn’t make up for the tens of thousands of articles and images that have already been scraped from MacStories. Nor is robots.txt a complete solution, so it’s just one of four approaches we’re taking to protect our work.

Read more