Andrew
Andrew

Reputation: 7768

Scrape google's all search results based on certain criteria?

I am working on my mapper and I need to get the full map of newegg.com

I could try to scrap NE directly (which kind of violates NE's policies), but they have many products that are not available via direct NE search, but only via google.com search; and I need those links too.

Here is the search string that returns 16mil of results: https://www.google.com/search?as_q=&as_epq=.com%2FProduct%2FProduct.aspx%3FItem%3D&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=newegg.com&as_occt=url&safe=off&tbs=&as_filetype=&as_rights=

I want my scraper to go over all results and log hyperlinks to all these results. I can scrap all the links from google search results, but google has limit of 100 pages for each query- 1,000 results and again, google is not happy with this approach. :)

I am new to this; Could you advise / point me in the right direction ? Are there any tools/methodology that could help me to achieve my goals?

Upvotes: 0

Views: 4576

Answers (3)

John
John

Reputation: 7826

It might be a bit late but I think it is worth to mention that you can professionally scrape Google reliable and not cause problems with it.

Actually it is not of any threat I know about to scrape Google.
It is cahllenging if you are unexperienced but I am not aware about a single case of legal consequence and I am always following this topic.

Maybe one of the largest cases of scraping happened some years ago when Microsoft scraped Google to power Bing. Google was able to proof it by placing fake results which do not exist in real world and Bing suddenly took them up.
Google named and shamed them, that's all that happened as far as I remember.

Using the API is rarely ever a real use, it costs a lot of money to use it for even a small amount of results and the free amount is rather small (40 lookups per hour before ban).
The other downside is that the API does not mirror the real search results, in your case maybe less a problem but in most cases people want to get the real ranking positions.

Now if you do not accept Googles TOS or ignore it (they did not care about your TOS when they scraped you in their startup) you can go another route.
Mimic a real user and get the data directly from the SERPs.

The clue here is to send around 10 requests per hour (can be increased to 20) with each IP address (yes you use more than one IP). That amount has proven to cause no problem with Google over the past years.
Use caching, databases, ip rotation management to avoid hitting it more often than required.
The IP addresses need to be clean, unshared and if possible without abusive history.
The originally suggested proxy-list would complicate the topic a lot as you receive unstable, unreliable IPs with questionable absuive use, share and history.

There is an open source PHP project on http://scraping.compunect.com which contains all the features you need to start, I used it for my work which now runs for some years without troubles. Thats a finished project which is mainly built to be used as customizable base of your project but runs standalone too.

Also PHP is not a bad choice, I originally was sceptical but I was running PHP (5) as background process for two years without a single interruption.
The performance is easily good enough for such a project so I would give it a shot.
Otherwise, PHP code is like C/JAVA .. you can see how things are done and repeat them in your own project.

Upvotes: 0

Jodrell
Jodrell

Reputation: 35716

I've not tried it but you can use googles custom search API. Of course, its starts to cost money after 100 searches a day. I guess they must be running a business ;p

Upvotes: 1

Kiril
Kiril

Reputation: 40345

I am new to this; Could you advise / point me in the right direction ? Are there any tools/methodology that could help me to achieve my goals?

Google takes a lot of steps to prevent you from crawling their pages and I'm not talking about merely asking you to abide by their robots.txt. I don't agree with their ethics, nor their T&C, not even the "simplified" version that they pushed out (but that's a separate issue).

If you want to be seen, then you have to let google crawl your page; however, if you want to crawl Google then you have to jump through some major hoops! Namely, you have to get a bunch of proxies so you can get past the rate limiting and the 302s + captcha pages that they post up any time they get suspicious about your "activity."

Despite being thoroughly aggravated about Google's T&C, I would NOT recommend that you violate it! However, if you absolutely need to get the data, then you can get a big list of proxies, load them in a queue and pull a proxy from the queue each time you want to get a page. If the proxy works, then put it back in the queue; otherwise, discard the proxy. Maybe even give a counter for each failed proxy and discard it if it exceeds some number of failures.

Upvotes: 3

Related Questions