Reputation: 788
I'm asking for your oppinion / experiences about this.
Our CMS is fetching info from the HTTP_USER_AGENT string. Recently we have discovered a bug in the code - forgot to check if HTTP_USER_AGENT is present (which is possible, but honestly: we simply skipped that, didn't expected that to happen) or not - these cases resulted in an error. So we have corrected it, and installed a tracking there: if HTTP_USER_AGENT is not set an alert is sent to our tracking system.
Now we have data/statistics from many websites from the past months. Now our stats show this is really rare. ~ 0.05-0.1%
Another interesting observation: these requests are single. Didn't find any case where this "user" has multiple pageviews in the same session...
This forced us thinking... Should we treat these requests as robots? And simply block them out... Or that would be a serious mistake?
Googlebot and other "good robots" are always sending HTTP_USER_AGENT info.
I know it is possible that firewalls or proxy servers MAY alter (or remove) this user-agent info. But according to our stats I can not clarify this...
What are your experiences? Is here anyone else who made any research about this topic?
Other posts I found on stackoverflow are simply accepting the fact "it is possible this info is not sent". But why don't we question that for a moment? Is it really normal??
Upvotes: 5
Views: 4508
Reputation: 788
So, let's summarize some things - based on reactions.
Probably the best way is to combine all possibilities. :-)
If this is the 1st (in the session - it is enough) incoming request, we may check the request immediatelly against multiple criterias. On server side we (may) have a dynamic database (built from user-agent info strings / IP addresses) We can create this db by mirroring public databases. (Yes, there are several public, regularly updated databases available on the internet to check bots. They contain not only user-agent strings but source IPs too)
If we have a hit we can quick check it using the database. If that filter says "OK", we may mark it as a trusted bot and serve the request.
We have a problem if there is no user-agent info available in the request... (Actually this was the origin of my question). What to do if we do not have user-agent info? :-)
We need to make a decision here.
The easiest way to simply deny these requests - consider this abnormal. Of course from this point we may loose real users. But according to our stats it is not a big risk - I think. It is also possible to send back a human-readable message "Sorry, but your browser doesn't send user-agent info so your request is denied" - or whatever. If this is a bot there will be noone to read that anyway. If this is a humanoid we may kindly give her/him useable instructions.
If we decide not to deny these requests, we may initiate a post-tracking mechanism suggested by MrCode here. OK, we serve THAT request but try to start collecting behaviour info. How? E.g. note the IP address in db (greylist that), and pass back a fake CSS file in the response - which will be served not by the webserver statically but our server side language: PHP, Java or whatever we are using. If this is a robot it is very unlikely that it will try to download a CSS file... While if this is a real browser it will definetly do - probably within a very short (e.g. 1-2 secs) time frame. We can easily continue the process on the action which is serving the fake CSS file. Just do an IP lookup in the greylist db, and if we judge the behaviour normal, we may white-list that IP address (for example..)
If we have another request from a grey-listed IP address again
a) within the 1-2 secs time frame : we may delay our response a few seconds (waiting for the parallel thread, maybe it will download the fake CSS meanwhile...), and check our greylist db periodically to see if the IP address disappeared or not
b) over the 1-2 secs time frame: we simply deny the response
So, something like that... How does it sounds?
But this is not perfect yet. Since during this mechanism we served one real page to the potential bot... I think we can also avoid this. We may send back an empty, slightly delayed redirect page for this 1st request... This can be done with HTML HEAD section easily. Or wwe may also use Javascript for that, which is a great bot-filter again... but could be real user filter too with Javascript switched off (I have to say, if I have a visitor with no user-agent string and stwitched off Javascript, that should go to hell really...) Of course we can add some text to the page "you will be redirected soon" or something to calm down potential real users. While this page is waiting for redirect to happen a real browser will download the fake CSS - so the IP will be whitelisted for the time the redirect occurs, and voila
Upvotes: 0
Reputation: 64526
I would consider the lack of user-agent abnormal for genuine users, however it is still a [rare] possibility which may be caused by a firewall, proxy or privacy software stripping the user-agent.
A request missing a user-agent is most likely a bot or script (not necessarily a search engine crawler). Although you can't say for sure of course.
Other factors that may indicate a bot/script:
Upvotes: 5