Reputation: 19837
I'm trying to get google results html for the search term
intitle:index.of ”last modified” ”parent directory” (mp3|wma|ogg) "test" -htm -html -php -asp
using file_get_contents
so something like this :
$file = file_get_html("http://www.google.com/search?q=intitle:index.of%20%20%94last%20modified%94%20%20%94parent%20directory%94%20%20%28mp3|wma|ogg%29%20%20%22test%22%20-htm%20-html%20-php%20-asp");
(basically this is the search term :
http://www.google.com/search?q=intitle:index.of ”last modified” ”parent directory” (mp3|wma|ogg) "test" -htm -html -php -asp
)
and its doing a 503
anyone know how I can get this working?
Thanks
Upvotes: 1
Views: 4970
Reputation: 7826
The question is a bit outdated but I'll still give it a shot as the answers are not that great.
First of all, using file_get_contents() is not going to work with Google.
Google will reject your query (and it did it:-)
As the selected answers correctly said, their TOS says you are not allowed to access it autoamted. They defend their service against that.
However it's your decision to ignore the non-scraping TOS of a scraping-mega-business, also it's your decision to actually accept the TOS in a legal binding way.
This said, there are several possibilities to continue:
If you have a very low volume of requests you can use your normal internet connection (no proxies, etc) but you need to make your query a bit more intelligent.
Look into "curl" for PHP, it's likely already installed.
Set the User agent to something like this: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
This will make Google think you are a Chrome browser, not a PHP script.
From here on you can use DOM or regex or similar means to continue parsing the HTML content.
The problem here will be that Google regularly changes the html source code and the detection logic, that happens every few months to a year.
Take a look at the source code and information here: http://google-scraper.squabbel.com
You'll find open source PHP parsing routines and curl code with some user agent, should help you get started fast.
If you need to scrape large amounts of results you'll need a bit more action, just comment here if you need more help.
Upvotes: 0
Reputation: 314
The search API is deprecated. You have to parse the html with this regexp :
/url\?q=([^<>&"]*)&
Be careful not to "spam" Google, limit the number of your queries, use a lot of proxies, simulate human behaviour...
Upvotes: -1
Reputation: 23783
Scraping is against Google's TOS (read 5.3). You should use their API:
http://code.google.com/apis/ajaxsearch/documentation/
There are examples on how to use it in PHP. Using the API also returns a structured object (JSON) so you'll save resources with CPU power (parsing) and bandwidth (JSON contains data only).
Upvotes: 5