Reputation: 5
Basically there are a couple hundred subpages I'm pulling off a site (as a test run), and then I have to parse each of those couple hundred subpages for some data. Now all this is working and fine. But of course, it takes too long because there are so many pages, if I did this in serial. So I used curl_multi_exec, but now I'm running into the problem where some of those pages will return blank. Which pages are blank is quite random so I'm assuming it has to do with the web server deciding not to respond given that I'm spamming it with 200 requests at once. Is there a way to either limit the number of requests at once or have curl redo the request if it didn't return properly, or otherwise deal with this problem?
Existing curl code:
function multiple_html_requests($nodes){
$mh = curl_multi_init();
$curl_array = array();
foreach ($nodes as $i=>$url){
$curl_array[$i] = curl_init($url);
curl_setopt($curl_array[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($mh, $curl_array[$i]);
}
$running = NULL;
do{
usleep(10000);
curl_multi_exec($mh, $running);
} while($running > 0);
$res = array();
foreach($nodes as $i=>$url){
$res[$url] = curl_multi_getcontent($curl_array[$i]);
}
foreach($nodes as $i=>$url){
curl_multi_remove_handle($mh, $curl_array[$i]);
}
curl_multi_close($mh);
return $res;
}
Upvotes: 0
Views: 1144
Reputation: 5540
You can use this class:
https://github.com/petewarden/ParallelCurl
Is a layer over curl multi and support setting maximum number of threads
Upvotes: 1