Reputation: 6506
I'm using the code as bellow to get the wanted content form HTML by DOMDocument,
$subject = 'some html code';
$doc = new DOMDocument('1.0');
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
$docSave = new DOMDocument('1.0');
foreach ( $result as $node ) {
$domNode = $docSave->importNode($node, true);
$docSave->appendChild($domNode);
}
echo $docSave->saveHTML();
The problem is that if there is a spcial character in HTML $subject like space or new line then it is converted to html entitle. Input HTML is far away form being in good style and some special characters are also within paths in tags, for instance:
$subject = '<div><a href='http://www.site.com/test.php?a=1&b=2, 3,
4'></a></div>';
will produce:
<div><a href='http://www.site.com/test.php?a=1&b=2,%203,%0A%204'></a></div>
instead of:
<div><a href='http://www.site.com/test.php?a=1&b=2, 3,
4'></a></div>'
What one can do to omit conversion of special characters to their entities if wants to keep the invalid html?
I tried do set this flag substituteEntities to false but I got no improvement, maybe I used it wrong? some examples of code would be very helpful.
Upvotes: 0
Views: 1013
Reputation: 324640
You can't use a parser and be able to manipulate the bad HTML. A parser would clean up the HTML in order to parse it.
If you absolutely must use the bad HTML, use regexes but be aware that there is an extreme risk of head injury as you will either be -brick'd- or bang your head against the desk too much.
Upvotes: 2