Reputation: 6490
I'm working with text/HTML diffing engine that's using XML in it's core but we're inputting HTML5 data, I wonder how to take care of tags that don't need to be closed in HTML5, but must be closed in XML. For Example:
<img alt="" height="239" src="http://example.com/image.png" width="272">
Do I need to convert every tag manually (Just like this example)?
Is there a tool that would do this for me? And save a headache escaping all self-closing HTML tags?
For example xml_parse()
runs following code like it has an error, but body
has a valid HTML which is invalid XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html [<!ENTITY Aacute "Á">]>
<body>
<div>
<figure class="table ">
<figcaption>
<p class="table_number"></p>
<p class="table_title" epub:type="title"></p>
</figcaption>
<table class="code ">
<tr>
<td width="50">
<img alt="" height="239" src="http://example.com/image.png" width="272">
</td>
</tr>
</table>
</figure>
</div>
</body>
Upvotes: 1
Views: 10846
Reputation: 2500
The proper method to parse HTML, including HTML5, and then format it into XML is to use SGML, the superset of HTML and XML. You can use the osx
program (part of the OpenSP/OpenJade package) specifically designed for this purpose. Install it via sudo apt-get install opensp
on Ubuntu/Debian.
In SGML, you use a DTD file containing markup declarations to tell SGML which start- and end-element tags can be omitted, among other things. You can use my HTML 5.1 DTD at http://sgmljs.net/docs/w3c-html51-dtd.html for this purpose (just copy the DTD code text on that page into a file named html51.dtd
, say). The HTML file to parse then needs to reference the .dtd file so its first line should look like
<!DOCTYPE html SYSTEM "html51.dtd">
assuming html51.dtd
is in the same directory as the file to parse. In case you wondered, SGML is where the DOCTYPE
declaration at the begin of many HTML documents come from, though browsers abused it for detecting HTML versions and other stuff. Anyway, your HTML must not contain two or more DOCTYPE declarations. So if it already contains a line such as
<!DOCTYPE html>
you replace that line with the one I wrote above.
Now you just invoke
osx your-file.html > your-file.xml
(where your-file.html
is the file you want to parse and that you've edited to begin with the proper DOCTYPE declaration) and you have a proper XML file your-file.xml
, or will see detailed error messages otherwise.
If you want to learn more about my HTML DTD, I gave a talk on it a year ago at the XML Prague conference. The slides and the full text are linked from http://sgmljs.net/blog.html.
Upvotes: 1
Reputation: 944149
In general, you can use PHP's built-in DOM handling routines to parse HTML and output XML:
$html = <<< HEREDOC
<!DOCTYPE html>
<body>
<div>
<figure class="table ">
<figcaption>
<p class="table_number"></p>
<p class="table_title" epub:type="title"></p>
</figcaption>
<table class="code ">
<tr>
<td width="50">
<img alt="" height="239" src="http://example.com/image.png" width="272">
</td>
</tr>
</table>
</figure>
</div>
</body>
HEREDOC;
$dom = new DOMDocument;
$dom->loadHTML($html);
echo $dom->saveXml($dom), PHP_EOL;
Unfortunately, your use of an XML prolog and attempt to extend the HTML 5 Doctype as if it were an XML/SGML Doctype prevents the DOM library from successfully parsing it.
Upvotes: 2
Reputation: 21681
I would update the old tags with something like this,
$field = preg_replace('/\<img([^>]+)(?<!\/)>/', '<img\1/>', $field);
You can see it here
Using a negative lookbehind we can match all unclosed img
tags, capture the "guts" in each one, and then replace it with a closed tag.
\<img
literal match([^>]+)
capture anything that is not a >
(?<!\/)>
negative lookbehind, match the ending >
if not preceded by a /
, ie. matches >
not />
So given a tag like this
it will capture \1
, ({}
to show spacing is captured)
{ alt="" height="239" src="http://example.com/image.png" width="272"}
Then we simply replace the <img
and then the >
with a />
and put the "guts" back in with \1
And our tag is now closed
<img alt="" height="239" src="http://example.com/image.png" width="272"/>
This could be expanded with another capture group and a list of tags like this:
$field = preg_replace('/\<(img|br)([^>]*)(?<!\/)>/', '<\1\2/>', $field);
And now it will match <br>
and replace it with <br/>
as well as the img
tag. All the while ignoring closed tags like this:
<img alt="" height="239" src="http://example.com/image2.png" width="272"/>
See this one here
So, it's not impossible.
I feel obligated to mention, that you should always export a backup of the table before doing any changes of this sort of scope. That way you can be sure if something is wrong you have a safety net.
Upvotes: 1