Reputation: 2719
I'm trying to parse some HTML with PHP, but there is an error. Here is the relevant code, which can be run on the command line ($ php script.php
).
<?php
function images_to_links($text)
{
$dom = new \DOMDocument('1.0', 'UTF-8');
// Load the document, hiding and then restoring error setting
$internalErrors = libxml_use_internal_errors(true);
$dom->loadHTML(mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
libxml_use_internal_errors($internalErrors);
// Extract images from the dom
$xpath = new DOMXPath($dom);
// Other processing code removed for this example
$cleaned_html = $dom->saveHTML();
return $cleaned_html;
}
$some_text = <<<EOD
<blockquote>asdf</blockquote>
<a href="http://example.com/">click here</a>
<br />
<p><a href="http://example.com/">another link</a></p>
EOD;
print images_to_links($some_text);
Expected output:
<blockquote>asdf</blockquote>
<a href="http://example.com/">click here</a>
<br />
<p><a href="http://example.com/">another link</a></p>
Actual output -- notice how the blockquote
has wrapped around the other elements:
<blockquote>asdf<a href="http://example.com/">click here</a><br><p><a href="http://example.com/">another link</a></p></blockquote>
Is there an error in my code or is this a bug with domdocument?
Upvotes: 0
Views: 161
Reputation: 6253
LibXML requires a root node, so interprets the first element it finds as the root node (ignoring its closing tag).
Upvotes: 2
Reputation: 35337
I wouldn't consider it a bug. My assumption is that DOMDocument, like most DOM utilities, expects everything to be nested under a single tag like <html>
.
By using the LIBXML_HTML_NOIMPLIED
flag, you're telling DOMDocument to forgo the step it usually takes with partial HTML by wrapping it in <html><body>
tags.
http://php.net/manual/en/libxml.constants.php
Upvotes: 1