user1422508
user1422508

Reputation: 109

How to detect browser spoofing and robots from a user agent string in php

So far I am able to detect robots from a list of user agent string by matching these strings to known user agents, but I was wondering what other methods there are to do this using php as I am retrieving fewer bots than expected using this method.

I am also looking to find out how to detect if a browser or robot is spoofing another browser using a user agent string.

Any advice is appreciated.

EDIT: This has to be done using a log file with lines as follows:

129.173.129.168 - - [11/Oct/2011:00:00:05 -0300] "GET /cams/uni_ave2.jpg?time=1318302291289 HTTP/1.1" 200 20240 "http://faculty.dentistry.dal.ca/loanertracker/webcam.html" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.23) Gecko/20110920 Firefox/3.6.23"

This means I can't check user behaviour aside from access times.

Upvotes: 9

Views: 16522

Answers (5)

T9b
T9b

Reputation: 3502

Your question specifically relates to detection using the user agent string. As many have mentioned this can be spoofed.

To understand what is possible in spoofing, and to see how difficult it is to detect, you are probably best advised to learn the art in PHP using cURL.

In essence using cURL almost everything that can be sent in a browser(client) request can be spoofed with the notable exception of the IP, but even here a determined spoofer will also hide themselves behind a proxy server to eliminate your detecting their IP.

It goes without saying that using the same parameters each time a request is made will enable a spoofer to be detected, but rotating with different parameters will make it very difficult, if not impossible to detect any spoofers amongst genuine traffic logs.

Upvotes: 1

Igal Zeifman
Igal Zeifman

Reputation: 1146

Because, as previously stated, you can spoof user-agents & IP, these cannot be used for reliable bot detection.

I work for a security company and our bot detection algorithm look something like this:

  1. Step 1 - Gathering data:

    a. Cross-Check user-agent vs IP. (both need to be right)

    b. Check Header parameters (what is missing, what is the order and etc...)

    c. Check behavior (early access and compliance to robots.txt, general behavior, number of pages visited, visit rates and etc)

  2. Step 2 - Classification:

    By cross verifying the data, the bot is classified as "Good", "Bad" or "Suspicious"

  3. Step 3 - Active Challenges:

    Suspicious bots undergo the following challenges:

    a. JS Challenge (can it activate JS?)

    b. Cookie Challenge (can it accept coockies?)

    c. If still not conclusive -> CAPTCHA

This filtering mechanism is VERY effective but I don't really think it could be replicated by a single person or even an unspecialized provider (for one thing, challenges and bot DB needs to be constantly updated by security team).

We offer some sort of "do it yourself" tools in form of Botopedia.org, our directory that can be used for IP/User-name cross-verification, but for truly efficient solution you will have to rely on specialized services.

There are several free bot monitoring solutions, including our own and most will use the same strategy I've described above (or similar).

GL

Upvotes: 6

WebChemist
WebChemist

Reputation: 4411

No, user agents can be spoofed so they are not to be trusted.

In addition to checking for Javascript or image/css loads, you can also measure pageload speed as bots will usually crawl your site a lot faster than any human visitor would jump around. But this only works for small sites, popular sites that would have a lot of visitors behind a shared external IP address (large corporation or university campus) might hit your site at bot-like rates.

I suppose you could also measure the order in which they load as bots would crawl in a first come first crawl order where as human users would usually not fit that pattern, but thats a bit more complicated to track

Upvotes: 2

laifukang
laifukang

Reputation: 311

In addition to filtering key words in the user agent string, I have had luck with putting a hidden honeypot link on all pages:

<a style="display:none" href="autocatch.php">A</a>

Then in "autocatch.php" record the session (or IP address) as a bot. This link is invisible to users but it's hidden characteristic would hopefully not be realized by bots. Taking the style attribute out and putting it into a CSS file might help even more.

Upvotes: 12

Kyros
Kyros

Reputation: 512

Beyond just comparing user agents, you would keep a log of activity and look for robot behavior. Often times this will include checking for /robots.txt and not loading images. Another trick is to ask the client if they have javascript since most bots won't mark it as enabled.

However, beware, you may well accidently get some people who are genuinely people.

Upvotes: 4

Related Questions