Reputation: 6480
I have to fetch multiple web pages, let's say 100 to 500. Right now I am using curl to do so.
function get_html_page($url) {
//create curl resource
$ch = curl_init();
//set url
curl_setopt($ch, CURLOPT_URL, $url);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
//$output contains the output string
$html = curl_exec($ch);
//close curl resource to free up system resources
curl_close($ch);
return $html;
}
My major concern is the total time taken by my script to fetch all these web pages. I know that the time taken is directly proportional to my internet speed and hence the majority time is taken by $html = curl_exec($ch);
function call.
I was thinking that instead of creating and destroying curl instance again and again for each and every web page, if I create it only once and then just reuse it for each and every page and finally in the end destroy it. Something like:
<?php
function get_html_page($ch, $url) {
//$output contains the output string
$html = curl_exec($ch);
return $html;
}
//create curl resource
$ch = curl_init();
//set url
curl_setopt($ch, CURLOPT_URL, $url);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
.
.
.
<fetch web pages using get_html_page()>
.
.
.
//close curl resource to free up system resources
curl_close($ch);
?>
Will it make any significant difference in the total time taken? If there is any other better approach then please let me know about it also?
Upvotes: 1
Views: 133
Reputation: 21563
How about trying to benchmark it? It may be more efficient to do it the second way, but I don't think it will add up to much. I'm sure your system can create and destroy curl instances in microseconds. It has to initiate the same HTTP connections each time either way, too.
If you were running many of these at the same time and were worried about system resources, not time, it might be worth exploring. As you noted, most of the time spent doing this will be waiting for network transfers, so I don't think you'll notice a change in overall time with either method.
Upvotes: 1
Reputation: 151
For web scraping I would use : YQL + JSON + xPath. You'll implement it using cURL I think you'll save a lot of resources.
Upvotes: 0