wget Google search (300 000 lookups)

Question

I have data which I need to modify using the first entry of a certain Google search. This search has to be repeated for about 300 000 times (each row) with varying search keywords.

I wrote a bash script for that using wget. However after about 30 (synchronous) requests, my queries seem to get blocked.

Connecting to www.google.com (www.google.com)|74.125.24.103|:80... connected. HTTP request sent, awaiting response... 404 Not Found

ERROR 404: Not Found.

I am using this snippet:

wget -qO- ‐‐limit-rate=20k --user-agent='Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0' "http://www.google.de/search?q=wikipedia%20$encodedString"

I am dependent on it to work so I hope someone has experience. It is not a regular job and does not need to be done quickly - it would even be acceptable if the 300000 requests take over a week.

Marcus M&#252;ller · Accepted Answer

Google won't let you do this; it has a rather advanced set of heuristics to detect "non-human" usage. If you want to do something automated with Google, it kind of forces you to use their API.

Other than distributing your queries over a very large set of clients (given the fact that you have 3*10^5 queries, and get blocked after 3*10^1, I'd say around 10,000), which is neither feasible nor really in the right order of complexity, you'll need to use any automatable API.

Luckily, Google offers a JSON API, which is far better parseable by scripts, so have a look at https://stackoverflow.com/a/3727777/4433386 .

wget Google search (300 000 lookups)

Answers (1)

Related Questions