Reputation: 2742
I've been looking around the internet hoping that this is possible, I basically need to get just the title of a webpage and nothing else.
web crawlers can take a long time performing tasks because they have to load pages before examinining them, this is inefficient for what I am trying to achieve... here's what I have so far
php code
$url = 'http://www.ebay.com/itm/300702997750#ht_500wt_1156';
$str = file_get_contents($url);
$title = '';
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$titleArr);
$title = $titleArr[1];
}
I want to know whether it would be possible to crawl only part of a page (for example the first 2000 characters of page).
Any help would be appreciated, Thanks.
Upvotes: 3
Views: 898
Reputation: 72975
You could use substr to just grab the first 1000 chars, alternatively, you could use
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com/');
curl_setopt($ch, CURLOPT_RANGE, '0-500');
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
echo $result;
that will only download the first 500 bytes. You can bench that by running something like this extremely ugly rubbish code:
$url = 'http://www.example.com/';
$range = array();
$repeats = 10;
function average($a){
return array_sum($a)/count($a) ;
}
for ($i=0;$i<$repeats;$i++) {
$time_start = microtime(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RANGE, '0-500');
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
$time_end = microtime(true);
$time = $time_end - $time_start;
curl_close($ch);
$range[] = $time;
}
echo "With range: average = ".round(average($range),2)." seconds (Min: ".round(min($range),2).", Max: ".round(max($range),2).")\n";
$range = array();
for ($i=0;$i<$repeats;$i++) {
$time_start = microtime(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
$time_end = microtime(true);
$time = $time_end - $time_start;
curl_close($ch);
$range[] = $time;
}
echo "Without range: average = ".round(average($range),2)." seconds (Min: ".round(min($range),2).", Max: ".round(max($range),2).")\n";
If I run that on my site (http://www.focalstrategy.com/), I get:
With range: average = 0.38 seconds (Min: 0.35, Max: 0.41)
Without range: average = 0.56 seconds (Min: 0.53, Max: 0.7)
Against http://en.wikipedia.org/wiki/PHP, I get:
With range: average = 0.11 seconds (Min: 0.05, Max: 0.5)
Without range: average = 0.48 seconds (Min: 0.34, Max: 0.78)
Against Stack Overflow I get:
With range: average = 1.31 seconds (Min: 1.1, Max: 1.46)
Without range: average = 1.37 seconds (Min: 1.18, Max: 1.7)
and against eBay I get:
With range: average = 1.75 seconds (Min: 1.56, Max: 1.99)
Without range: average = 1.74 seconds (Min: 1.51, Max: 2.14)
You can see by testing that SO and eBay don't support range requests.
In summary, sites that support this will get a speed up, those that don't, won't, you'll just get the whole code instead.
Upvotes: 4