clarkk
clarkk

Reputation: 1

load and parse HTML string

When I try to parse search results from Google I get an error

code

$html = file_get_contents('http://www.google.dk/search?q='.urlencode($query).'&start=0&num=100', false, $context);
                
$doc = new DOMDocument();
$doc->loadHTML($html);

error

PHP Warning:  DOMDocument::loadHTML(): Input is not proper UTF-8, indicate encoding ! in Entity, line: 1 in /var/www/dynaccount.com/class/Cronjob_check_serp_position.php on line 132

Warning: DOMDocument::loadHTML(): Input is not proper UTF-8, indicate encoding ! in Entity, line: 1 in /var/www/dynaccount.com/class/Cronjob_check_serp_position.php on line 132
PHP Warning:  DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 in /var/www/dynaccount.com/class/Cronjob_check_serp_position.php on line 132

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 in /var/www/dynaccount.com/class/Cronjob_check_serp_position.php on line 132
PHP Warning:  DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 in /var/www/dynaccount.com/class/Cronjob_check_serp_position.php on line 132

Upvotes: 0

Views: 245

Answers (1)

Professor Abronsius
Professor Abronsius

Reputation: 33823

libxml has some built in error handling which would help

            $query='php rocks';

            $data=file_get_contents('http://www.google.co.uk/search?q='.urlencode( $query ).'&start=0&num=100');
            libxml_use_internal_errors( true );
            $html = new DOMDocument('1.0','utf-8');
            $html->validateOnParse=false;
            $html->standalone=true;
            $html->preserveWhiteSpace=true;
            $html->strictErrorChecking=false;
            $html->substituteEntities=false;
            $html->recover=true;
            $html->formatOutput=true;
            $html->loadHTML( $data );
            $parse_errs=serialize( libxml_get_last_error() );
            libxml_clear_errors();


            $xpath=new DOMXPath( $html );
            $div=$html->getElementById('ires');
            $col=$xpath->query("ol/li/h3/a", $div );

            foreach( $col as $node ) echo $node->getAttribute('href').'<br />';

            $html=null;
            $xpath=null;

Upvotes: 1

Related Questions