Reputation: 1968
I am trying to extract urls from a large number of google search results. Getting them from the source code is proving to be quite challenging as the delimiters are not clear and not all of the urls are in the code. Is there a tool that can extract urls from a certain area of an image? If so that may be a better solution.
Any help would be much appreciated.
Upvotes: 0
Views: 422
Reputation: 7448
Use this excellent lib: http://simplehtmldom.sourceforge.net/manual.htm
// Grab the source code
$html = file_get_html('http://www.google.com/');
// Find all anchors, returns a array of element objects
$ret = $html->find('a');
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $ret->href;
EDit :
All "natural" search urls are in the #res div it seems.. With simplehtmldom find first #res, than all url inside of it. Don't remember exactly the syntax but it must be this way :
$ret = $html->find('div[id=res]')->find('a');
or maybe
$html->find('div[id=res] a');
Upvotes: 0
Reputation: 14106
Try using the JSON/Atom Custom Search API instead: http://code.google.com/apis/customsearch/v1/overview.html. It gives you 100 api calls per day, something you can increase to 10000 per day, if you pay.
Upvotes: 1