jducoeur: (Default)
[personal profile] jducoeur
During today's massive update of the Period Games Homepage, I'm discovering a new horror. Many of the sites I point to are now dead, which isn't a surprise. Many of them have been taken over by domain thieves, which also isn't a surprise.

What *is* a surprise is that many of those thieves have turned on robots.txt files that wind up blocking the Wayback Machine from producing results: it appears that archive.org respects robots.txt a little *too* much. The result is that a large number of useful pages are just plain inaccessible -- I can't even get at their archived versions. Grr...

(BTW, time for another reminder that archive.org is one of the most important and unsung sites on the Web -- the Wayback Machine is the only really good archive of the Web's history, and is often invaluable. I've given them another donation today...)

(no subject)

Date: 2010-02-22 01:20 am (UTC)
cellio: (avatar-face)
From: [personal profile] cellio
Argh, slimy!

(no subject)

Date: 2010-02-22 02:11 pm (UTC)
From: [identity profile] metageek.livejournal.com
it appears that archive.org respects robots.txt a little *too* much

There's a little about that in Wikipedia.

(no subject)

Date: 2010-02-22 02:39 pm (UTC)
From: [identity profile] metageek.livejournal.com
They might be able to do it by looking at the domain's whois data. Hard to say just what algorithm they could use, though; the simple ones I've thought all have edge cases where they'd incorrectly believe the domain had changed hands. For example, if the domain lapses and gets hijacked, the domain creation date will reset, so don't apply robots.txt to anything before the creation date—but the same might happen if the domain lapses and then the original owner recreates it.

Profile

jducoeur: (Default)
jducoeur

July 2025

S M T W T F S
  12345
6789101112
13141516171819
20212223242526
27 28293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags