omg
omg

Reputation: 139842

how to identify web crawlers of google/yahoo/msn by PHP?

AFAIK,

$_SERVER['REMOTE_HOST'] should end up with "google.com" or "yahoo.com".

but is it the most ensuring method?

any other way out?

Upvotes: 1

Views: 15170

Answers (9)

Rihad
Rihad

Reputation: 103

Checking by User-Agent is hilariously unreliable, anyone can write anything they want there. A better way is reverse DNS check, attackers would need to compromise major search engine's DNS to get around the check, which is unlikely. For those who don't know: you take an address, say 1.2.3.4, do a PTR lookup on it, which should be in the zone of major search enignes, like google.com, search.msn.com, although this isn't enough because it can be easily forged, then comes the important step: you do a forward DNS lookup on the name you got, like search.msn.com, and 1.2.3.4 should be in the list of A records you receive, which would mean 1.2.3.4 is a legitimate MSN search engine address.

Upvotes: 0

user3879851
user3879851

Reputation: 56

Google/Bing/Yahoo Crawlers IP Addresses -

http://myip.ms/info/bots/Google_Bing_Yahoo_Facebook_etc_Bot_IP_Addresses.html

Upvotes: 1

cletus
cletus

Reputation: 625007

You identify search engines by user agent and IP address. More info can be found in How to identify search engine spiders and webbots. It's also worth noting this list. You shouldn't treat user agents (or even remote hosts) as necessarily definitive however. User agents are really nothing more than what the other end tells you it is and it is of course free to tell you anything. It's trivial to write code to pretend to be Googlebot.

In PHP, this means looking at $_SERVER['HTTP_USER_AGENT'] and $_SERVER['REMOTE_HOST'].

There are a lot of search engines but honestly it's only the big few you really care about generally speaking. Google and Yahoo together have almost all of the market. But of course it depends on what you're trying to achieve.

Note: be very careful of treating search engines differently to normal users (like the "evil hyphen site" as Joel put it) when it comes to content. In particularly egregious cases, this could get your site removed from that search engine. Even if that doesn't happen you will probably put some users off who go to a site expecting something. If they're then presented with a "Please register to see this article" box instead, well, gratz on your high bounce rate.

Upvotes: 9

NinethSense
NinethSense

Reputation: 9028

$_SERVER['HTTP_USER_AGENT']

Check various user agent strings here: http://www.user-agents.org/

Upvotes: 1

Silfverstrom
Silfverstrom

Reputation: 29322

I hacked something together, but you will have to look at $_SERVER['HTTP_USER_AGENT'] to see if they come from a search-engine domain.

function is_crawlers() {

   $sites = 'Google|Yahoo|msnbot|'; // Add the rest of the search-engines 

   return (preg_match("/$sites/", $_SERVER['HTTP_USER_AGENT']) > 0) ? true : false;  

   }

Upvotes: -1

Chad Birch
Chad Birch

Reputation: 74518

First of all, I hope you're not trying to do this in order to serve search engine bots different content than your site contains for normal users. If they discover you doing this, your site will get removed from their listings entirely. So long as you understand the risks of it, you can usually find information about what unique user-agent they will use:

  • Verifying Googlebot (use user-agent, reverse DNS if you want to be sure)
  • Yahoo's user agent will contain "Slurp"

However, some people writing (usually poorly-behaved) web scrapers will set their User Agent strings to be the same as "legitimate" crawlers such as Google's. You can catch these by doing lookups on the bot's IP address/hostname to ensure that they actually are coming from Google/Yahoo/etc. Some more info about what to look for in hostname lookups (from this article):

  • Google crawlers will end with googlebot.com like in crawl-66-249-70-244.googlebot.com.
  • Yahoo crawlers will end with crawl.yahoo.net like in llf520064.crawl.yahoo.net.
  • Live Search crawlers will end with search.msn.com like in msnbot-65-55-104-161.search.msn.com.
  • Ask crawlers will end with ask.com like in crawler4037.ask.com.

Upvotes: 8

The real napster
The real napster

Reputation: 2324

I dont think crawlers comes from google.com and I know some other people you don't want to treat as bots that comes from there. All who search for your site.

What you need to do is take a look at the IP of the different bots. http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=80553

Upvotes: 2

Pablo Fernandez
Pablo Fernandez

Reputation: 287390

The best way to do it with well know and behaving robots, like those you mentioned, is by user agent which you can find on $_SERVER['HTTP_USER_AGENT'].

Upvotes: 0

Chris Bartow
Chris Bartow

Reputation: 15111

You are probably better off using $_SERVER['HTTP_USER_AGENT'] and look for Googlebot or Yahoo! Slurp.

Upvotes: 5

Related Questions