Reputation: 2339
I am trying to extract only one div element having id=MainText
from this page. However, when I load the contents of this page into the DOM object I get several errors:
Tag g:plusone invalid... Unexpected end tag... htmlParseEntityRef: no name ... htmlParseEntityRef: expecting ';' ...
So, I was thinking if it is possible to ignore all other stuff from the document and go straight to the part I want, get only the div element with that particular id.
Or else, is there an alternative to using domdocument class for achieving the samething? I'm not very good at writing regular expressions.
Upvotes: 1
Views: 865
Reputation: 14909
The w3c validator, on a quick run, spits a lot of errors. Try to get the html clean in this way before to feed it to DomDocument:
#Assuming that $html is your html source (retrieve it as you prefer)
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
$document = new DOMDocument();
$document->loadHTML((string)$tidy);
Upvotes: 2
Reputation: 48101
Never use RegEx with HTML.
Stick with DOMDocument and maybe suppress the error if they don't cause further problems.
Upvotes: 0