Reputation: 45
I am trying to scrape some info from some websites using PHP CURL, the problem is it gives me wrong (different) content than opening it with normal browser
The example site is this: http://web.vecer.com/portali/vecer/v1/default.asp?kaj=3&id=2010091905576453
I am trying to get the meta tags, in the browser it returns as:
<meta name="title" content="Razmere v Preboldu se umirjajo" />
<meta name="description" content="Za prebivalci Prebolda je nemirna noč, ki ji je sledilo jutro s še dodatnimi padavinami..." />
<link rel="image_src" href="http://web.vecer.com/portali/podatki/2010/09/19/slike/online_Prebold0-100.jpg" />
<link rel="target_url" href="http://web.vecer.com/portali/vecer/v1/default.asp?kaj=3&id=2010091905576453" />
but my curl gets this:
<title>VECER.COM: </title>
<meta name="title" content="" />
<meta name="description" content="" />
<link rel="image_src" href="-100.jpg" />
<link rel="target_url" href="http://web.vecer.com/portali/vecer/v1/default.asp?kaj=3&id=1899123000000000">
here is my code:
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6');
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_REFERER, "http://www.windowsphone.com");
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
What I'm doing wrong?
Upvotes: 0
Views: 1289
Reputation: 552
hi for meta and all other attribute scraping you can use http://simplehtmldom.sourceforge.net/
$target_url = "http://stackoverflow.com/questions";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
//storeLink($url,$target_url);
echo "<br />Link stored: $url";
}
Upvotes: 1