R891
R891

Reputation: 2719

Is this a bug in PHP's DOMDocument Library?

I'm trying to parse some HTML with PHP, but there is an error. Here is the relevant code, which can be run on the command line ($ php script.php).

<?php
function images_to_links($text)
{
    $dom = new \DOMDocument('1.0', 'UTF-8');

    // Load the document, hiding and then restoring error setting
    $internalErrors = libxml_use_internal_errors(true);
    $dom->loadHTML(mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    libxml_use_internal_errors($internalErrors);

    // Extract images from the dom
    $xpath = new DOMXPath($dom);

    // Other processing code removed for this example 

    $cleaned_html = $dom->saveHTML();
    return $cleaned_html;
}

$some_text = <<<EOD
<blockquote>asdf</blockquote>
<a href="http://example.com/">click here</a>
<br />
<p><a href="http://example.com/">another link</a></p>
EOD;

print images_to_links($some_text);

Expected output:

<blockquote>asdf</blockquote>
<a href="http://example.com/">click here</a>
<br />
<p><a href="http://example.com/">another link</a></p>

Actual output -- notice how the blockquote has wrapped around the other elements:

<blockquote>asdf<a href="http://example.com/">click here</a><br><p><a href="http://example.com/">another link</a></p></blockquote>

Is there an error in my code or is this a bug with domdocument?

Upvotes: 0

Views: 161

Answers (2)

javier_domenech
javier_domenech

Reputation: 6253

LibXML requires a root node, so interprets the first element it finds as the root node (ignoring its closing tag).

Upvotes: 2

Devon Bessemer
Devon Bessemer

Reputation: 35337

I wouldn't consider it a bug. My assumption is that DOMDocument, like most DOM utilities, expects everything to be nested under a single tag like <html>.

By using the LIBXML_HTML_NOIMPLIED flag, you're telling DOMDocument to forgo the step it usually takes with partial HTML by wrapping it in <html><body> tags.

http://php.net/manual/en/libxml.constants.php

Upvotes: 1

Related Questions