Oh how I wish the wayback machine would ignore robots.txt... So many websites lost to history because some rookie webmaster put some misguided commands into the file without thinking about the consequences (eg. block all crawlers except google)
The worst part is when a site is stored in the wayback, then the domain expires and the new owner (or squatter) has a robots.txt that blocks everything, then all the old content becomes inaccessible.
They should store a history of WHOIS data for the site's domain and make separate archives for when the owner changes, I think. Also, why did anyone think that applying robots.txt retroactively is a good idea? :/
The worst part of this is that it's retroactive, so adding a robots.txt that denies the wayback machine access causes the machine to delete all history of the site. This is really annoying for patent cases where the prior art is on the applicant's own website: they can go and remove the prior art so it's no longer available (which is why examiners make copies of the wayback content before making their reports).
To be pedantic, they aren't lost. They are just unavailable until the robots.txt goes away. I'm fairly sure the Internet Archive aren't too keen on deleting things (unless you absolutely super duperly wants it gone and you're the author/owner of the data).
I'm surprised that some upstart search engine hasn't made a selling point that they ignore robots.txt and claim they search the pages google doesn't or something.
Speaking as an upstart search engine guy (blekko) who also has a bunch of webpages and a huge robots.txt, that's a bad idea. Such a crawler would be knocking down webservers by running expensive scripts and clicking links that do bad things like deleting records from databases or reverting edits in wikis. You don't want to go there.
Really? I was always taught that search engines only do "get" requests, and anything that modifies data is in a "post" request. Are there really that many broken web sites out there, that hasn't already fallen victim to crawlers that ignore robots.txt?
I noticed this today. Googling "united check in" and clicking the link "check" gave me a link that told me the confirmation number that I entered was invalid though I never entered one.
IANAL. But, although in principle, providing an easy opt-out shouldn't really matter with respect to copyright and so forth, as a practical matter it seems as if it does--in that, if you at least vaguely care about your website not being mirrored you have an easy way to prevent it. An organization like the Internet Archive simply can't afford (in terms of either time or money) to take a more aggressive approach to mirroring.
To be more specific--short of granting the Internet Archive some sort of special library exemption--what if I were to say, create a special archive of popular cartoon strips. What's the distinction?
[EDIT: The retroactive robots.txt situation seems less clear but, like orphan works, also depends on the scenarios you care to devise.]
Or, with a more historical lens, lots of history has been learned by pouring over intimate private personal correspondences of historical figures - most of whom I would imagine would feel quite perturbed to see their love letters on display in museums.
Should historians not read private letters sent long ago? Should they swear to some oath and take a moral stand that such things shouldn't be examined?
If the answer is "No, they should read them.", then in that same way, then why, for historical record, should we observe robots.txt? Isn't it the same thing?
means that you should not crawl the site today. It should have no effect whatsoever on displaying pages that WERE crawled before the timestamp on the robots.txt file.
I don't think there is an inarguable answer to my rhetorical question. People's intents and wishes do matter.
But there also is an idea from antiquity about the public good and the commons. I guess at some point my personal wishes get trumped by this overarching principle.
The whole point of the question was that someone would say "You may not read my love letters" and then society said "Too bad, we're doing it anyway. And reprinting it in highschool text books."
Is that ok? I don't think there's a clear line and I do think there are probably moral boundaries.
I'm by no means Lawrence Lessig and this type of discourse I'm really not experienced at. I do think there are many important questions here that we may need to rethink our thoughts on.
One might nitpick that there was initially some distinction between the publicly available internet and a private facebook; although the latter seems to be making strides to narrow this gap.
Yes, because secrets and forgetting can be important.
It's not our cultural tradition that every written work (train schedules, greeting cards, friendly notes, lolcats,etc.) must be archived at the Library of Congress. I'm not sure that it'd be a good idea.
No one is stopping you from archiving my websites if you think the data will have some importance. It seems like you're suggesting that archive.org is the universal keeper of history and everyone should agree with that idea.
I'd love to see this, even if they'd keep the content private for x number of years. Copyright runs out eventually and it would still be archived then.