Reputation: 4196

What is the best way to black-list search engines?

I have built a photo community web application in PHP/MySQL, using CodeIgniter as a framework. All content is public so search engines regularly drop by. This is exactly what I want, yet it has two unwanted side effects:

Each visit creates a session in my session table.
Each visit of a search engine to a photo page increases the view counter

As for the second problem, I am rewriting the call to my view count script to be called from javascript only, that should prevent a count increase from search engines, right?

As for the session table, my thinking was to clean it up after the fact using a cron, to not have an impact on performance. I'm recording the IP and user agent string in the session table so it appears to me that a blacklist approach is best? If so, what is the best way to approach it? Is there an easy/reusable way to determine that a session is from a search engine?

Upvotes: 2

Answers (3)

user229044

Reputation: 239291

Why are you worried about either of these situations? The best strategy for dealing with crawlers is to treat them like any other user.

Sessions created by search engines are no different than any other session. They all have to be garbage collected, as you can't possibly assume that every user is going to click the "logout" button when they leave your site. Handle them the same way as you handle any expired session. You have to do this anyways, so why invest extra time in treating search engines differently?

As far as search engine incrementing view counters, why is that a problem? "View count" is a miss-leading term anyways; what you're really telling people is how many times the page has been requested. It's not up to you to insure a pair of eyeballs actually sees the page, and there is really no reasonable way of doing so. For every bot you "blacklist", there will be a dozen more one-offs scraping your content and not serving up friendly user-agent strings.

Upvotes: 1