Reputation: 1436
I have data which I need to modify using the first entry of a certain Google search. This search has to be repeated for about 300 000 times (each row) with varying search keywords.
I wrote a bash script for that using wget. However after about 30 (synchronous) requests, my queries seem to get blocked.
Connecting to www.google.com (www.google.com)|74.125.24.103|:80... connected. HTTP request sent, awaiting response... 404 Not Found
ERROR 404: Not Found.
I am using this snippet:
wget -qO- ‐‐limit-rate=20k --user-agent='Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0' "http://www.google.de/search?q=wikipedia%20$encodedString"
I am dependent on it to work so I hope someone has experience. It is not a regular job and does not need to be done quickly - it would even be acceptable if the 300000 requests take over a week.
Upvotes: 0
Views: 1704
Reputation: 36442
Google won't let you do this; it has a rather advanced set of heuristics to detect "non-human" usage. If you want to do something automated with Google, it kind of forces you to use their API.
Other than distributing your queries over a very large set of clients (given the fact that you have 3*10^5 queries, and get blocked after 3*10^1, I'd say around 10,000), which is neither feasible nor really in the right order of complexity, you'll need to use any automatable API.
Luckily, Google offers a JSON API, which is far better parseable by scripts, so have a look at https://stackoverflow.com/a/3727777/4433386 .
Upvotes: 1