LuZ
LuZ

Reputation: 345

DOMDocument loadHTML doesn't work properly on a server

I run the code first on MAMP and it worked very well. But when I tried to run the code on another server, I got a lot of warnings like:

Warning: DOMDocument::loadHTML(): Unexpected end tag : head in Entity, line: 3349 in /cgihome/zhang1/html/cgi-bin/getPrice.php on line 17 Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced tag in Entity, line: 3350 in /cgihome/zhang1/html/cgi-bin/getPrice.php on line 17 Warning: DOMDocument::loadHTML(): Tag header invalid in Entity, line: 3517 in /cgihome/zhang1/html/cgi-bin/getPrice.php on line 17

The codes are following:

<?php
 $amazon = file_get_contents('http://www.amazon.com/blablabla');
 $doc = new DOMdocument();
 $doc->loadHTML($amazon);
 $doc->saveHTML();
 $price = $doc -> getElementById('actualPriceValue')->textContent;
 $ASIN = $doc -> getElementById('ASIN')->getAttribute('value');
?>

Anyone knows what's going on? Thanks!

Upvotes: 34

Views: 45771

Answers (3)

hakre
hakre

Reputation: 197933

To disable the warning, you can use

libxml_use_internal_errors(true);

This works for me, Manual, read on:


Background: You are loading invalid HTML. Invalid HTML is quite common, DOMDocument::loadHTML corrects most of the problems, but gives warnings by default.

With libxml_use_internal_errors you can control that behavior. Set it before loading the document:

$previously = libxml_use_internal_errors(true);
$doc->loadHTML($amazon);

Then after loading you can deal with the errors (if you want/need to):

/* @var LibXMLError[] $xmlErrors */
$xmlErrors = libxml_get_errors();

And finally clear them (as they will add up) and restore the previous setting if applicable:

unset($xmlErrors);
libxml_clear_errors();
libxml_use_internal_errors($previously);

References

Upvotes: 137

Aminah Nuraini
Aminah Nuraini

Reputation: 19170

You can surpress the warning like this:

@$doc->loadHTML($amazon);

Upvotes: 6

Pascal
Pascal

Reputation: 2405

This problem is related to non xHTML code

As DOMdocument() can only process clean XHTML you need to clean up your code

Php have an extension that does the job pretty well. Called Tidy php.net/book.tidy

It might be tricky as you may need to enable it in your php.ini

Then

$tidy_config = array( 
                     'clean' => true, 
                     'output-xhtml' => true, 
                     'show-body-only' => true, 
                     'wrap' => 0, 

                     ); 

$tidy = tidy_parse_string( $html, $tidy_config, 'UTF8'); 
$tidy->cleanRepair(); 
$doc = new DOMdocument();
$doc->loadHTML( (string) $tidy);

Upvotes: 6

Related Questions