HelpNeeder
HelpNeeder

Reputation: 6490

Parsing HTML to XML

I'm working with text/HTML diffing engine that's using XML in it's core but we're inputting HTML5 data, I wonder how to take care of tags that don't need to be closed in HTML5, but must be closed in XML. For Example:

<img alt="" height="239" src="http://example.com/image.png" width="272">

Do I need to convert every tag manually (Just like this example)?

Is there a tool that would do this for me? And save a headache escaping all self-closing HTML tags?

For example xml_parse() runs following code like it has an error, but body has a valid HTML which is invalid XML:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html [<!ENTITY Aacute "&#193;">]>
<body>
    <div>
        <figure class="table ">
            <figcaption>
                <p class="table_number"></p>
                <p class="table_title" epub:type="title"></p>
            </figcaption>
            <table class="code ">
                <tr>
                    <td width="50">
                        <img alt="" height="239" src="http://example.com/image.png" width="272">
                    </td>
                </tr>
            </table>
        </figure>
    </div>
</body>

Upvotes: 1

Views: 10846

Answers (3)

imhotap
imhotap

Reputation: 2500

The proper method to parse HTML, including HTML5, and then format it into XML is to use SGML, the superset of HTML and XML. You can use the osx program (part of the OpenSP/OpenJade package) specifically designed for this purpose. Install it via sudo apt-get install opensp on Ubuntu/Debian.

In SGML, you use a DTD file containing markup declarations to tell SGML which start- and end-element tags can be omitted, among other things. You can use my HTML 5.1 DTD at http://sgmljs.net/docs/w3c-html51-dtd.html for this purpose (just copy the DTD code text on that page into a file named html51.dtd, say). The HTML file to parse then needs to reference the .dtd file so its first line should look like

<!DOCTYPE html SYSTEM "html51.dtd">

assuming html51.dtd is in the same directory as the file to parse. In case you wondered, SGML is where the DOCTYPE declaration at the begin of many HTML documents come from, though browsers abused it for detecting HTML versions and other stuff. Anyway, your HTML must not contain two or more DOCTYPE declarations. So if it already contains a line such as

<!DOCTYPE html>

you replace that line with the one I wrote above.

Now you just invoke

osx your-file.html > your-file.xml

(where your-file.html is the file you want to parse and that you've edited to begin with the proper DOCTYPE declaration) and you have a proper XML file your-file.xml, or will see detailed error messages otherwise.

If you want to learn more about my HTML DTD, I gave a talk on it a year ago at the XML Prague conference. The slides and the full text are linked from http://sgmljs.net/blog.html.

Upvotes: 1

Quentin
Quentin

Reputation: 944149

In general, you can use PHP's built-in DOM handling routines to parse HTML and output XML:

$html = <<< HEREDOC
<!DOCTYPE html>
<body>
    <div>
        <figure class="table ">
            <figcaption>
                <p class="table_number"></p>
                <p class="table_title" epub:type="title"></p>
            </figcaption>
            <table class="code ">
                <tr>
                    <td width="50">
                        <img alt="" height="239" src="http://example.com/image.png" width="272">
                    </td>
                </tr>
            </table>
        </figure>
    </div>
</body>
HEREDOC;

$dom = new DOMDocument;
$dom->loadHTML($html);
echo $dom->saveXml($dom), PHP_EOL;

Unfortunately, your use of an XML prolog and attempt to extend the HTML 5 Doctype as if it were an XML/SGML Doctype prevents the DOM library from successfully parsing it.

Upvotes: 2

ArtisticPhoenix
ArtisticPhoenix

Reputation: 21681

I would update the old tags with something like this,

$field = preg_replace('/\<img([^>]+)(?<!\/)>/', '<img\1/>', $field);

You can see it here

Using a negative lookbehind we can match all unclosed img tags, capture the "guts" in each one, and then replace it with a closed tag.

  • \<img literal match
  • ([^>]+) capture anything that is not a >
  • (?<!\/)> negative lookbehind, match the ending > if not preceded by a /, ie. matches > not />

So given a tag like this

it will capture \1, ({}to show spacing is captured)

 { alt="" height="239" src="http://example.com/image.png" width="272"}

Then we simply replace the <img and then the > with a /> and put the "guts" back in with \1

And our tag is now closed

<img alt="" height="239" src="http://example.com/image.png" width="272"/>

This could be expanded with another capture group and a list of tags like this:

$field = preg_replace('/\<(img|br)([^>]*)(?<!\/)>/', '<\1\2/>', $field);

And now it will match <br> and replace it with <br/> as well as the img tag. All the while ignoring closed tags like this:

<img alt="" height="239" src="http://example.com/image2.png" width="272"/>

See this one here

So, it's not impossible.

I feel obligated to mention, that you should always export a backup of the table before doing any changes of this sort of scope. That way you can be sure if something is wrong you have a safety net.

Upvotes: 1

Related Questions