Reputation: 19
I tried accessing facebook.com
webpages from previous time.
And the site showed me an error that it can not save pages because of the site robots.txt/
Can anyone tell which statements in the robots.txt
are making the site inaccessible to web.archive.org
I guess it is because of the #permission statement as mentioned here (http://facebook.com/robots.txt)
Is there any other way I can do this for my site as well.
I also don't want woorank.com
or builtwith.com
to analyze my site.
Note: search engine bots should face no problems while crawling my site and indexing it if I add some statements to robots.txt
in order to achieve results which are mentioned above.
Upvotes: 1
Views: 4899
Reputation: 1406
Since 2017, archive.org bot does not respect robots.txt anymore.
I inspected what traces the bot leaves. I created a test.php page, that writes the $_SERVER variable to a txt file:
file_put_contents("request.txt", json_encode($_SERVER) );
These were the relevant headers:
{
"HTTP_X_FORWARDED_FOR": "207.241.225.246",
"HTTP_USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/605.1.15 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/605.1.15",
"HTTP_VIA": "Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; http://archive.org/details/archive.org_bot), Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; http://archive.org/details/archive.org_bot), 1.1 warcprox",
...
}
You can block the bot if you find "archive.org_bot" string in the HTTP_VIA header:
if ( isset($_SERVER['HTTP_VIA']) && str_contains($_SERVER['HTTP_VIA'], "archive.org_bot") )
{
http_response_code(403);
die();
}
Upvotes: 3
Reputation: 7286
If you would like to submit a request for archives of your site or account to be excluded from web.archive.org, send us a request to [email protected] and indicate:
https://help.archive.org/help/how-do-i-request-to-remove-something-from-archive-org/
Upvotes: 0
Reputation: 96607
The Internet Archive (archive.org) crawler uses the User-Agent value ia_archiver
(see their documentation).
So if you want to target this bot in your robots.txt, use
User-agent: ia_archiver
And this is exactly what Facebook does in its robots.txt:
User-agent: ia_archiver Allow: /about/privacy Allow: /full_data_use_policy Allow: /legal/terms Allow: /policy.php Disallow: /
Upvotes: 3