Reputation: 158
Ok, so I got it all working, but the preg_match_all wont work towards Yahoo.
If you take a look at:
http://se.search.yahoo.com/search?p=random&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t
then you can see that in their html, they have
<span class="url" id="something random"> the actual link </span>
But when I try to preg_match_all, I wont get any result.
preg_match_all('#<span class="url" id="(.*)">(.+?)</span>#si', $urlContents[2], $yahoo);
Anyone got an idea?
I'm trying to preg_match_all the results i get from Google using a cURL curl_multi_getcontent method.
I have succeeded to fetch the site and so, but when I'm trying to get the result of the links, it just takes too much.
I'm currently using:
preg_match_all('#<cite>(.+)</cite>#si', $urlContents[0], $links);
And that starts where it should be, but it doesn't stop, it just keeps going.
Check the HTML at www.google.com/search?q=random
for example and you will see that all links start with and ends with .
Could someone possible help me with how I should retreive this information? I only need the actual link address to each result.
public function multiSearch($question)
{
$sites['google'] = "http://www.google.com/search?q={$question}&gl=sv";
$sites['bing'] = "http://www.bing.com/search?q={$question}";
$sites['yahoo'] = "http://se.search.yahoo.com/search?p={$question}";
$urlHandler = array();
foreach($sites as $site)
{
$handler = curl_init();
curl_setopt($handler, CURLOPT_URL, $site);
curl_setopt($handler, CURLOPT_HEADER, 0);
curl_setopt($handler, CURLOPT_RETURNTRANSFER, 1);
array_push($urlHandler, $handler);
}
$multiHandler = curl_multi_init();
foreach($urlHandler as $key => $url)
{
curl_multi_add_handle($multiHandler, $url);
}
$running = null;
do
{
curl_multi_exec($multiHandler, $running);
}
while($running > 0);
$urlContents = array();
foreach($urlHandler as $key => $url)
{
$urlContents[$key] = curl_multi_getcontent($url);
}
foreach($urlHandler as $key => $url)
{
curl_multi_remove_handle($multiHandler, $url);
}
foreach($urlContents as $urlContent)
{
preg_match_all('/<li class="g">(.*?)<\/li>/si', $urlContent, $matches);
//$this->view_data['results'][] = "Random";
}
preg_match_all('#<div id="search"(.*)</ol></div>#i', $urlContents[0], $match);
preg_match_all('#<cite>(.+)</cite>#si', $urlContents[0], $links);
var_dump($links);
}
Upvotes: 4
Views: 396
Reputation: 45914
As in Darhazer's answer you can turn on ungreedy mode for the whole regex using the U
pattern modifier, or just make the pattern itself ungreedy (or lazy) by following it with a ?
:
preg_match_all('#<cite>(.+?)</cite>#si', ...
Upvotes: 2
Reputation: 26719
run the regular expression in U-ngready mode
preg_match_all('#<cite>(.+)</cite>#siU
Upvotes: 4