Krill
Krill

Reputation: 19

Stop web.archive.org to save the site pages

I tried accessing facebook.com webpages from previous time. And the site showed me an error that it can not save pages because of the site robots.txt/

Can anyone tell which statements in the robots.txt are making the site inaccessible to web.archive.org I guess it is because of the #permission statement as mentioned here (http://facebook.com/robots.txt)

Is there any other way I can do this for my site as well.

I also don't want woorank.com or builtwith.com to analyze my site.

Note: search engine bots should face no problems while crawling my site and indexing it if I add some statements to robots.txt in order to achieve results which are mentioned above.

Upvotes: 1

Views: 4899

Answers (3)

hlorand
hlorand

Reputation: 1406

Since 2017, archive.org bot does not respect robots.txt anymore.

I inspected what traces the bot leaves. I created a test.php page, that writes the $_SERVER variable to a txt file:

file_put_contents("request.txt", json_encode($_SERVER) );

These were the relevant headers:

{
  "HTTP_X_FORWARDED_FOR": "207.241.225.246",
  "HTTP_USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/605.1.15 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/605.1.15",
  "HTTP_VIA": "Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; http://archive.org/details/archive.org_bot), Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; http://archive.org/details/archive.org_bot), 1.1 warcprox",
...
}

You can block the bot if you find "archive.org_bot" string in the HTTP_VIA header:

if ( isset($_SERVER['HTTP_VIA']) && str_contains($_SERVER['HTTP_VIA'], "archive.org_bot") ) 
{
    http_response_code(403);
    die();
}

Upvotes: 3

Artur INTECH
Artur INTECH

Reputation: 7286

If you would like to submit a request for archives of your site or account to be excluded from web.archive.org, send us a request to [email protected] and indicate:

https://help.archive.org/help/how-do-i-request-to-remove-something-from-archive-org/

Upvotes: 0

unor
unor

Reputation: 96607

The Internet Archive (archive.org) crawler uses the User-Agent value ia_archiver (see their documentation).

So if you want to target this bot in your robots.txt, use

User-agent: ia_archiver

And this is exactly what Facebook does in its robots.txt:

User-agent: ia_archiver
Allow: /about/privacy
Allow: /full_data_use_policy
Allow: /legal/terms
Allow: /policy.php
Disallow: /

Upvotes: 3

Related Questions