One of the fixtures of the modern web is the robots.txt file—a file intended to notify web-crawling robots what parts of web sites are off-limits to them, so as to avoid reindexing duplicate content or bandwidth-intensive large files. A number of search engines, such as Google, honor robots.txt restrictions, though there’s no technical reason they have to.
Until recently, the Internet Archive has also been honoring the instructions from robots.txt files—but this is just about to change. On the Internet Archive’s announcement blog, Mark Graham explains that robots.txt’s search-indexing functionality is increasingly at odds with the site’s mission to archive the web as it was.
Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.
A few months ago, Archive.org’s crawlers stopped honoring military and government sites’ robots.txt files. After having observed the results of this trial run, the Internet Archive is now ready to stop honoring robots.txt files in general, so as to create complete archives of the web including sites that have since expired and been parked. The Archive will still respond to removal requests by email to info@archive.org.
As it happens, we at TeleRead had some personal experience with the Internet Archive robots.txt feature. A number of articles from TeleRead managed to get lost when the site was moved over to hosting on the commercial TeleRead.com domain—including my “Paleo E-Books” series, which I was rather proud of. Subsequently, the only way to link to those lost articles was by using Archive.org. However, a few weeks ago, I noticed that the new site hosting where we moved that commercial site had a robots.txt file that was keeping Archive.org from making those old articles available.
Fortunately, we were able to get that robots.txt file removed and the content made accessible again. But if the Internet Archive had stopped honoring robots.txt files, that wouldn’t even have been an issue.
From an archival standpoint, disregarding robots.txt enforcement makes a lot of sense. On the other hand, sometimes robots.txt files are meant to prevent confidential information—such as social security numbers and other personally identifying information—from being exposed by search engines. What if that information should be exposed by the Internet Archive? Of course, the amount of damage that could be done is limited, because there’s no search engine for Archive.org content as yet; it can only be accessed via searching the URL of the original page.
Another concern is the amount of bandwidth charges that could be run up if archive.org starts downloading huge files from web sites every time it crawls them—let alone if other crawlers also stopped honoring robots.txt instructions.
Nonetheless, the matter of robots.txt files removing the contents of public web sites from the public record is important. What if every time a newspaper or magazine shut down, public libraries had to remove its back issues from circulation? In that respect, disregarding robots.txt files could be a good idea. It remains to be seen whether that will bring about any problems.
(Found via BoingBoing.)
If you found this post worth reading and want to kick in a buck or two to the author, click here.
Important post, Chris. So glad that the Archive is taking an aggressive stand on the robots.txt issue, especially with the risk of the Trumpsters zapping important data from federal sites. But for now there are still some problems in regard to access to the old TeleRead.com via the Wayback Machine. I’ve asked the Archive to do another crawl, so we can get rid of the past robots.txt problem. Meanwhile thanks to Nate Hoffelder and Reclaim Hosting for HTMLization of the TeleRead.com site. I’ll be doing a future post on the issue of preserving TeleRead. But first, I want to make sure the Wayback Machine issue is addressed. That almost surely will require action at the Archive’s end rather than ours.
LikeLike
The robots.txt has been ignored for years and the web hasn’t collapsed yet. Malware checks use real web browsers driven by script, running in virtual machines, to look for bad stuff. And to look for sites running fraudulent advertising. Which is funny because the act of browsing the site inflates their ad numbers…
LikeLike
I can understand the use cases presented, but there is such a thing as copyright and site use policies. When the goal of archive.org conflicts with those laws, expect copyright to win.
LikeLike
They best honor mine. A) How dare you hog what I pay for! B) My website best not be in your archives because IT IS MINE and if I want it no longer on the web — get outta dodge! C) Google does that and your entire web contents and then have advertisers pay to be listed in the cached portion! Copyright infringement! D) Just because you are archive.org that gives you no right to ASSUME that the robot.txt is strictly for search engines!
How dare you Internet Archive!
LikeLike
Yeah clients hiring freelance developers don’t want their files private until the final product, why on earth would they want that! /sarcasm
It’s the internet archive’s fault the robots.txt was retroactive in the first stupid place. Yes ‘.gov’ sites should have their robots.txt ignored for archive purposes, I have no problem with that. Government sites are public in nature, but the work I do as a freelance web developer while being paid to keep it private is NOT.
LikeLike