Reputation: 351
We are trying to get a better metric for the number of automated requests coming to our site. Our site serves a lot of data but also serves web pages. It's easy to distinguish web pages served from data files served, but some data files served were generated by manual requests through a web page. Automated requests typically fetch the data directly, using programs like curl or wget.
Our current practice is to periodically look at user agents strings and make judgments based on the user agent (wget, for instance) that this is an automated request. The problem is new agents are being added all the time, so we are behind the curve. But also, some of those retrieving data in an automated way fake it by using User Agent strings that claim they are a browser when they are not.
It occurred to me if based on an Apache log entry we could determine "Javascript is on" then a human sent the request. It's not perfect but it would be better and more portable than what we have now. If we detect it is turned off, we could count it as an automated request.
Is something like this possible? Is there any code or libraries out there smart enough to do this work for us, and is regularly maintained?
Upvotes: 2
Views: 722
Reputation: 4247
You could maintain a white list rather than a black list. Users will let you know if they cannot get your content using browser xyz and you can add them.
Upvotes: 1
Reputation: 3323
There is no direct way to make Apache detect if a client has JS activated.
The most useable approach would be to just see which IPs are responsible for unusually high request counts, and ban them. This can, in fact, be automated, e.g. with counting the IPs and sending 403 Errors when an IP is too active.
Upvotes: 1
Reputation: 4995
you might wanna have a look at http://nsg.cs.princeton.edu/publication/robot_usenix_06.pdf
Upvotes: 2