Reputation: 33146
I have a large number of HTML files that I need to process with XSLT, using an XML file to choose which HTML files, and what we're doing with them.
I tried:
This doesn't work, because:
Fine (I thought) ... XHTML is just XML, I just need to put it through HTML Tidy and say:
"output-xml yes ... output-html no ... output-xhtml no"
...but HTML Tidy ignores you if you attempt that, and forces html instead :(. It seems to be hardcoded to only output XML files if the input was XML to begin with.
Any ideas for how to:
NB: this has to work on OS X - it's part of a build process for iOS apps. That shouldn't be a big problem, but e.g. any windows-only tools aren't available. I'd like to achieve this with standard open-source cross-platform tools (like tidy, libxslt, etc)
Upvotes: 3
Views: 3187
Reputation: 33146
I finally discovered why XSLTproc / Saxon were refusing to parse the files if they were passed-in with a DOCTYPE html:
The DOCTYPE of the external document alters how they interpret the xmlns (namespace) directive. Tidy was declaring (correctly) "xmlns=...the xhtml: namespace" - so all my node-names were ... I don't know: non-existent? ... inside my XSLT. XSLT was just ignoring them, as if they didn't exist - it needed me to provide a compatible mapping to the same namespace
...strangely, if the DOCTYPE was xml, then they happily ignored the xmlns command - or they allowed me to reference nodes by unqualified name. This fooled me into thinking that they were point-blank ignoring the nodesets inside the xhtml DOCTYPE'd version.
So, the "solution" is something like this:
Example code:
Your stylesheet goes from this:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
...to this:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml">
Your select / match / document-import goes from this:
<xsl:copy-of select="document('html-files/file1.htm')/html/body"/>
...to this:
<xsl:copy-of select="document('html-files/file1.htm')/xhtml:html/xhtml:body"/>
NB: just to be clear: if you ignore namespaces, then it seems XSLT will work on files that are unDOCTYPED, even if they have a namespace in them. Don't make the mistake I made of thinking your XSLT is correct just because it appears to be :)
Upvotes: 2
Reputation: 11223
It's been a while, but I remember trying to use HTMLTidy to prep HTML files for XSLT and was disappointed by how easily it gave up while trying to "well form" the HTML. Then I found TagSoup, and was very pleased.
TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
I don't know if you're bound to HTMLTidy, but if not try this: http://home.ccil.org/~cowan/tagsoup/
As an example, here's a bad HTML file:
<body>
<p>Testing
</body>
And here's the tagsoup command and its ouput:
~ zyoung$ java -jar /usr/local/tagsoup-1.2.jar --html bad.html
src: bad.html
<html><body>
<p>Testing
</p></body></html>
Edit 01
Here is how tagsoup handles DOCTYPEs.
Here's a bad HTML file with a valid DOCTYPE:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<body>
<p>Testing
</body>
</html>
Here's how tagsoup handles it:
~ zyoung$ java -jar /usr/local/tagsoup-1.2.jar --html bad.html
src: bad.html
<html><body>
<p>Testing
</p></body></html>
It isn't until you explicitly pass a DOCTYPE to tagsoup that it attempts to output one:
~ zyoung$ java -jar /usr/local/tagsoup-1.2.jar --html --doctype-public=html bad.html
src: bad.html
<!DOCTYPE PUBLIC "html" "">
<html><body>
<p>Testing
</p></body></html>
I hope this helps,
Zachary
Upvotes: 0
Reputation: 24826
I think the main problem is given by the XML catalog doctype declaration. You can test this by removing the external entity reference in the input XHTML and see if the processor correctly works with it.
I would do as follows:
The main problem is that Saxon and xsltproc has not any option to disable external entities resolution. This is supported by MSXSL.exe command line utility with option -xe.
Upvotes: 0
Reputation: 11996
If you run xsltproc --help
, among the accepted input flags is a very conspicuous one called --html
which supposedly tells xsltproc
that:
--html: the input document is(are) an HTML file(s)
Presumably for this to work you must have valid HTML files to begin with, though. So you might want to tidy them up first.
Upvotes: 0
Reputation: 1185
XHTML is XML (if it is valid).
To get your XHTML processed as XML, you must not serve it as "text/html" MIME. Use application/xhtml+xml instead (keep in mind, that IE6 does not support to render this and will prompt a download window for your site).
In PHP do you serve it as xhtml+xml with the header()
function.
I think this should do the trick:
header('Content-Type: application/xhtml+xml');
Does this help?
Upvotes: 0