Belgin Fish
Belgin Fish

Reputation: 19837

Get Google Results PHP

I'm trying to get google results html for the search term

intitle:index.of  ”last modified”  ”parent directory”  (mp3|wma|ogg)  "test" -htm -html -php -asp

using file_get_contents

so something like this :

$file = file_get_html("http://www.google.com/search?q=intitle:index.of%20%20%94last%20modified%94%20%20%94parent%20directory%94%20%20%28mp3|wma|ogg%29%20%20%22test%22%20-htm%20-html%20-php%20-asp");

(basically this is the search term :

http://www.google.com/search?q=intitle:index.of  ”last modified”  ”parent directory” (mp3|wma|ogg)  "test" -htm -html -php -asp

)

and its doing a 503

anyone know how I can get this working?

Thanks

Upvotes: 1

Views: 4970

Answers (3)

John
John

Reputation: 7826

The question is a bit outdated but I'll still give it a shot as the answers are not that great.

First of all, using file_get_contents() is not going to work with Google.
Google will reject your query (and it did it:-)

As the selected answers correctly said, their TOS says you are not allowed to access it autoamted. They defend their service against that.
However it's your decision to ignore the non-scraping TOS of a scraping-mega-business, also it's your decision to actually accept the TOS in a legal binding way.

This said, there are several possibilities to continue:

If you have a very low volume of requests you can use your normal internet connection (no proxies, etc) but you need to make your query a bit more intelligent. Look into "curl" for PHP, it's likely already installed.
Set the User agent to something like this: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"

This will make Google think you are a Chrome browser, not a PHP script.
From here on you can use DOM or regex or similar means to continue parsing the HTML content.
The problem here will be that Google regularly changes the html source code and the detection logic, that happens every few months to a year.

Take a look at the source code and information here: http://google-scraper.squabbel.com

You'll find open source PHP parsing routines and curl code with some user agent, should help you get started fast.

If you need to scrape large amounts of results you'll need a bit more action, just comment here if you need more help.

Upvotes: 0

Julien Ricard
Julien Ricard

Reputation: 314

The search API is deprecated. You have to parse the html with this regexp :

/url\?q=([^<>&"]*)&

Be careful not to "spam" Google, limit the number of your queries, use a lot of proxies, simulate human behaviour...

Upvotes: -1

Aillyn
Aillyn

Reputation: 23783

Scraping is against Google's TOS (read 5.3). You should use their API:

http://code.google.com/apis/ajaxsearch/documentation/

There are examples on how to use it in PHP. Using the API also returns a structured object (JSON) so you'll save resources with CPU power (parsing) and bandwidth (JSON contains data only).

Upvotes: 5

Related Questions