Reputation: 242
I can't get my scraper to return the specific content I'm looking for. If I return $output, I see digg as though it's being hosted on my server, so I know I'm accessing the site properly, I'm just not able to then access elements from the new DOM. What am I doing wrong?
<?php
include('simple_html_dom.php');
function curl_download($url) {
$ch = curl_init(); //creates a new cURL resource handle
curl_setopt($ch, CURLOPT_URL, "http://digg.com"); // Set URL to download
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0"); // Set a referer
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true ); // Should cURL return or print out the data? (true = return, false = print)
curl_setopt($ch, CURLOPT_HEADER, 0); // Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_TIMEOUT, 10); // Timeout in seconds
$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
}
$html = new simple_html_dom();
$html->load($output, true, false );
foreach($html->find('div.digg-story__kicker') as $article) {
$article_title = $article->find('.digg-story__kicker')->innertext;
return $article_title;
}
echo $article_title;
?>
Edit: Okay, dumb mistake, I'm calling the function now:
$html = curl_download('http://digg.com')
and if I echo $html I'm seeing the "mirrored site", but when I use str_get_html($html)
which simple_html_dom.php says will //get html dom from string
I keep getting this error message:
Fatal error: Call to a member function str_get_html() on null in /home/andrew73124/public_html/scraper/scraper.php on line 31
Upvotes: 1
Views: 1397
Reputation: 33813
The curl function needed an additional setting - namely CURLOPT_FOLLOWLOCATION
and the function itself needs to return a value in order that it's values can be used. In the code below I return an object with both the response and the info which allows you to test for the http_code before attempting to process the response data.
This uses standard DOMDocument but no doubt using simple_dom will be easy to do.
function curl_download( $url ) {
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );/* NEW */
curl_setopt( $ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0" );
curl_setopt( $ch, CURLOPT_HEADER, 0 );
curl_setopt( $ch, CURLOPT_TIMEOUT, 10 );
$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
return (object)array(
'response' => $output,
'info' => $info
);
}
$output = curl_download( 'http://www.digg.com' );
if( $output->info['http_code']==200 ){
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->substituteEntities=true;
$dom->recover=true;
$dom->formatOutput=false;
$dom->loadHTML( $output->response );
libxml_clear_errors();
$xp=new DOMXPath( $dom );
$col=$xp->query('//div[@class="digg-story__kicker"]');
if( !empty( $col ) ){
foreach( $col as $node )echo $node->nodeValue;
}
} else {
echo '<pre>',print_r($output->info,true),'</div>';
}
Updated answer to include error mitigation code offered by libxml
- weidly though the code as it was orginally ran without issue locally before adding the libxml
error handling code....
Without the CURLOPT_FOLLOWLOCATION
set I get:
Array
(
[url] => http://www.digg.com
[content_type] => text/html
[http_code] => 301
[header_size] => 191
[request_size] => 79
[filetime] => -1
[ssl_verify_result] => 0
[redirect_count] => 0
[total_time] => 0.421
[namelookup_time] => 0.031
[connect_time] => 0.234
[pretransfer_time] => 0.234
[size_upload] => 0
[size_download] => 185
[speed_download] => 439
[speed_upload] => 0
[download_content_length] => 185
[upload_content_length] => 0
[starttransfer_time] => 0.421
[redirect_time] => 0
[certinfo] => Array
(
)
)
But with CURLOPT_FOLLOWLOCATION
set as true
I get
WE'VE SEEN BETTER ANIME TRIBUTE VIDEOS...<more>...RESIST THE URGE TO SUBTWEET A BAD APPLE
Upvotes: 1
Reputation: 20469
Your loop is weird, you are looping over the titles, so just access the innertext property:
foreach($html->find('div.digg-story__kicker') as $article) {
echo $article->innertext;
}
Upvotes: 1