Oh how I wish the wayback machine would ignore robots.txt... So many websites lo...

Cameron_D · on Nov 5, 2013

The worst part is when a site is stored in the wayback, then the domain expires and the new owner (or squatter) has a robots.txt that blocks everything, then all the old content becomes inaccessible.

gargron · on Nov 5, 2013

They should store a history of WHOIS data for the site's domain and make separate archives for when the owner changes, I think. Also, why did anyone think that applying robots.txt retroactively is a good idea? :/

axman6 · on Nov 5, 2013

The worst part of this is that it's retroactive, so adding a robots.txt that denies the wayback machine access causes the machine to delete all history of the site. This is really annoying for patent cases where the prior art is on the applicant's own website: they can go and remove the prior art so it's no longer available (which is why examiners make copies of the wayback content before making their reports).

ersii · on Nov 5, 2013

To be pedantic, they aren't lost. They are just unavailable until the robots.txt goes away. I'm fairly sure the Internet Archive aren't too keen on deleting things (unless you absolutely super duperly wants it gone and you're the author/owner of the data).

jccalhoun · on Nov 4, 2013

I'm surprised that some upstart search engine hasn't made a selling point that they ignore robots.txt and claim they search the pages google doesn't or something.

wumpus · on Nov 5, 2013

Speaking as an upstart search engine guy (blekko) who also has a bunch of webpages and a huge robots.txt, that's a bad idea. Such a crawler would be knocking down webservers by running expensive scripts and clicking links that do bad things like deleting records from databases or reverting edits in wikis. You don't want to go there.

derekp7 · on Nov 5, 2013

Really? I was always taught that search engines only do "get" requests, and anything that modifies data is in a "post" request. Are there really that many broken web sites out there, that hasn't already fallen victim to crawlers that ignore robots.txt?

wumpus · on Nov 5, 2013

Yes, there are a lot of broken websites out there.

blueblob · on Nov 5, 2013

I noticed this today. Googling "united check in" and clicking the link "check" gave me a link that told me the confirmation number that I entered was invalid though I never entered one.

ISL · on Nov 4, 2013

If their IPs became known, they might get blocked?

ghaff · on Nov 5, 2013

IANAL. But, although in principle, providing an easy opt-out shouldn't really matter with respect to copyright and so forth, as a practical matter it seems as if it does--in that, if you at least vaguely care about your website not being mirrored you have an easy way to prevent it. An organization like the Internet Archive simply can't afford (in terms of either time or money) to take a more aggressive approach to mirroring.

To be more specific--short of granting the Internet Archive some sort of special library exemption--what if I were to say, create a special archive of popular cartoon strips. What's the distinction?

[EDIT: The retroactive robots.txt situation seems less clear but, like orphan works, also depends on the scenarios you care to devise.]

slouch · on Nov 4, 2013

Robots.txt is the only way to opt-out of the Wayback Machine.

chrismonsanto · on Nov 4, 2013

A serious question: why should you be allowed to 'opt-out' of history? Is this really your call, as a website owner?

kristopolous · on Nov 4, 2013

Or, with a more historical lens, lots of history has been learned by pouring over intimate private personal correspondences of historical figures - most of whom I would imagine would feel quite perturbed to see their love letters on display in museums.

Should historians not read private letters sent long ago? Should they swear to some oath and take a moral stand that such things shouldn't be examined?

If the answer is "No, they should read them.", then in that same way, then why, for historical record, should we observe robots.txt? Isn't it the same thing?

cmarschner · on Nov 4, 2013

There's a technical reason - the blocked pages might open infinite URL spaces or bring the site down (crawler hitting /cgi-bin).

memracom · on Nov 5, 2013

That is NOT a technical reason.

Technically speaking a robots.txt that says

User-agent: * Disallow: /

means that you should not crawl the site today. It should have no effect whatsoever on displaying pages that WERE crawled before the timestamp on the robots.txt file.

_wiv7 · on Nov 5, 2013

Actually, if you want to interpret robots.txt that way, it raises the problem of "how long can I consider a robots.txt valid for?"

mburns · on Nov 5, 2013

Those cornercases can exist on sites that don't have a robots.txt and still have to be crawled correctly.

minor_nitwit · on Nov 5, 2013

so I take it you're for facebook documenting as much of everyone's lives as possible - for historical reasons?

kristopolous · on Nov 5, 2013

Probably not.

I don't think there is an inarguable answer to my rhetorical question. People's intents and wishes do matter.

But there also is an idea from antiquity about the public good and the commons. I guess at some point my personal wishes get trumped by this overarching principle.

The whole point of the question was that someone would say "You may not read my love letters" and then society said "Too bad, we're doing it anyway. And reprinting it in highschool text books."

Is that ok? I don't think there's a clear line and I do think there are probably moral boundaries.

I'm by no means Lawrence Lessig and this type of discourse I'm really not experienced at. I do think there are many important questions here that we may need to rethink our thoughts on.

existencebox · on Nov 5, 2013

One might nitpick that there was initially some distinction between the publicly available internet and a private facebook; although the latter seems to be making strides to narrow this gap.

ISL · on Nov 4, 2013

Yes, because secrets and forgetting can be important.

It's not our cultural tradition that every written work (train schedules, greeting cards, friendly notes, lolcats,etc.) must be archived at the Library of Congress. I'm not sure that it'd be a good idea.

Archive.org is a good idea.

mratzloff · on Nov 4, 2013

Since it's your bandwidth and your content, sure.

jrochkind1 · on Nov 4, 2013

copyright law says it is.

slouch · on Nov 6, 2013

No one is stopping you from archiving my websites if you think the data will have some importance. It seems like you're suggesting that archive.org is the universal keeper of history and everyone should agree with that idea.

hnha · on Nov 4, 2013

no, you can just mail them.

Paul12345534 · on Nov 4, 2013

I'd love to see this, even if they'd keep the content private for x number of years. Copyright runs out eventually and it would still be archived then.

aluhut · on Nov 4, 2013

We still have the NSA. They ignore everything.