Reputation: 3616
I have a bunch of URLs which are currently indexed in Google. Given those URLs, is there a way to figure out when was the last time Google crawled them ?
Manually, if i check the link in Google and check the 'cached' link, I see the date on when it was crawled. Is there a way to do this automatically ? A Google API of some sort ?
Thank you :)
Upvotes: 1
Views: 2138
Reputation: 559
<?php
$domain_name = $_GET["url"];
//get googlebot last access
function googlebot_lastaccess($domain_name)
{
$request = 'http://webcache.googleusercontent.com/search?hl=en&q=cache:'.$domain_name.'&btnG=Google+Search&meta=';
$data = getPageData($request);
$spl=explode("as it appeared on",$data);
//echo "<pre>".$spl[0]."</pre>";
$spl2=explode(".<br>",$spl[1]);
$value=trim($spl2[0]);
//echo "<pre>".$spl2[0]."</pre>";
if(strlen($value)==0)
{
return(0);
}
else
{
return($value);
}
}
$content = googlebot_lastaccess($domain_name);
$date = substr($content , 0, strpos($content, 'GMT') + strlen('GMT'));
echo "Googlebot last access = ".$date."<br />";
function getPageData($url) {
if(function_exists('curl_init')) {
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); // add useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
if((ini_get('open_basedir') == '') && (ini_get('safe_mode') == 'Off')) {
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
}
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
return @curl_exec($ch);
}
else {
return @file_get_contents($url);
}
}
?>
Just upload this PHP and create a Cron-Job. You can test it as following .../bot.php/url=http://www....
Upvotes: 2
Reputation: 29
You can check the google bot last visit using the link http://www.gbotvisit.com/
Upvotes: 0
Reputation: 2724
Google doesn't provide an API for this type of data. The best way of tracking last crawled information is to mine your server logs.
In your server logs, you should be able to identify Googlebot by it's typical user-agent: Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)
. Then you can see what URLs Googlebot has crawled, and when.
If you want to be sure it's Googlebot crawling those pages you can verify it with a Reverse DNS lookup.. Bingbot also supports Reverse DNS lookups.
If you don't want to manually parse your server logs, you can always use something like splunk or logstash. Both are great log processing platforms.
Also note, that the "cached" date in the SERPs doesn't always necessarily match the last crawled date. Googlebot can crawl your pages multiple times after the "cached" date, but not update their cached version. You can think of "cached date" as more of a "last indexed" date, but that's not exactly correct either. In either case, if you ever need to get a page re-indexed, you can always use Google Webmaster Tools (GWT). There's an option in GWT to force Googlebot to re-crawl a page, and also re-index a page. There's a weekly limit of 50 or something like that.
Upvotes: 3