fabio
fabio

Reputation: 2339

Get element by id using DomDocument on html page that is broken

I am trying to extract only one div element having id=MainText from this page. However, when I load the contents of this page into the DOM object I get several errors:

Tag g:plusone invalid... 
Unexpected end tag...
htmlParseEntityRef: no name ...
htmlParseEntityRef: expecting ';' ...

So, I was thinking if it is possible to ignore all other stuff from the document and go straight to the part I want, get only the div element with that particular id.

Or else, is there an alternative to using domdocument class for achieving the samething? I'm not very good at writing regular expressions.

Upvotes: 1

Views: 865

Answers (2)

Eineki
Eineki

Reputation: 14909

The w3c validator, on a quick run, spits a lot of errors. Try to get the html clean in this way before to feed it to DomDocument:

#Assuming that $html is your html source (retrieve it as you prefer)
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

$document = new DOMDocument();
$document->loadHTML((string)$tidy);

Upvotes: 2

dynamic
dynamic

Reputation: 48101

Never use RegEx with HTML.

Stick with DOMDocument and maybe suppress the error if they don't cause further problems.

Upvotes: 0

Related Questions