One of the fixtures of the modern web is the robots.txt file—a file intended to notify web-crawling robots what parts of web sites are off-limits to them, so as to avoid reindexing duplicate content or bandwidth-intensive large files. A number of search engines, such as Google, honor robots.txt restrictions, though there’s no technical reason they have to.
Until recently, the Internet Archive has also been honoring the instructions from robots.txt files—but this is just about to change. On the Internet Archive’s announcement blog, Mark Graham explains that robots.txt’s search-indexing functionality is increasingly at odds with the site’s mission to archive the web as it was.
Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.
A few months ago, Archive.org’s crawlers stopped honoring military and government sites’ robots.txt files. After having observed the results of this trial run, the Internet Archive is now ready to stop honoring robots.txt files in general, so as to create complete archives of the web including sites that have since expired and been parked. The Archive will still respond to removal requests by email to firstname.lastname@example.org.
As it happens, we at TeleRead had some personal experience with the Internet Archive robots.txt feature. A number of articles from TeleRead managed to get lost when the site was moved over to hosting on the commercial TeleRead.com domain—including my “Paleo E-Books” series, which I was rather proud of. Subsequently, the only way to link to those lost articles was by using Archive.org. However, a few weeks ago, I noticed that the new site hosting where we moved that commercial site had a robots.txt file that was keeping Archive.org from making those old articles available.
Fortunately, we were able to get that robots.txt file removed and the content made accessible again. But if the Internet Archive had stopped honoring robots.txt files, that wouldn’t even have been an issue.
From an archival standpoint, disregarding robots.txt enforcement makes a lot of sense. On the other hand, sometimes robots.txt files are meant to prevent confidential information—such as social security numbers and other personally identifying information—from being exposed by search engines. What if that information should be exposed by the Internet Archive? Of course, the amount of damage that could be done is limited, because there’s no search engine for Archive.org content as yet; it can only be accessed via searching the URL of the original page.
Another concern is the amount of bandwidth charges that could be run up if archive.org starts downloading huge files from web sites every time it crawls them—let alone if other crawlers also stopped honoring robots.txt instructions.
Nonetheless, the matter of robots.txt files removing the contents of public web sites from the public record is important. What if every time a newspaper or magazine shut down, public libraries had to remove its back issues from circulation? In that respect, disregarding robots.txt files could be a good idea. It remains to be seen whether that will bring about any problems.
(Found via BoingBoing.)
If you found this post worth reading and want to kick in a buck or two to the author, click here.