mire
mire

Reputation: 45

scraping with curl

I am trying to scrape some info from some websites using PHP CURL, the problem is it gives me wrong (different) content than opening it with normal browser

The example site is this: http://web.vecer.com/portali/vecer/v1/default.asp?kaj=3&id=2010091905576453

I am trying to get the meta tags, in the browser it returns as:

<meta name="title" content="Razmere v Preboldu se umirjajo" />
<meta name="description" content="Za prebivalci Prebolda je nemirna no&#269;, ki ji je sledilo jutro s &#353;e dodatnimi padavinami..." />
<link rel="image_src" href="http://web.vecer.com/portali/podatki/2010/09/19/slike/online_Prebold0-100.jpg" />
<link rel="target_url" href="http://web.vecer.com/portali/vecer/v1/default.asp?kaj=3&id=2010091905576453" />

but my curl gets this:

<title>VECER.COM: </title>
<meta name="title" content="" />
<meta name="description" content="" />
<link rel="image_src" href="-100.jpg" />
<link rel="target_url" href="http://web.vecer.com/portali/vecer/v1/default.asp?kaj=3&id=1899123000000000">

here is my code:

function curl($url){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6'); 
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
    curl_setopt($ch, CURLOPT_REFERER, "http://www.windowsphone.com");
    
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

What I'm doing wrong?

Upvotes: 0

Views: 1289

Answers (1)

Vipin Singh
Vipin Singh

Reputation: 552

hi for meta and all other attribute scraping you can use http://simplehtmldom.sourceforge.net/

$target_url = "http://stackoverflow.com/questions";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

  // grab all the on the page

$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
//storeLink($url,$target_url);
echo "<br />Link stored: $url";
}

Upvotes: 1

Related Questions