Carl Onager
Carl Onager

Reputation: 4122

Closing tags when extracting HTML from XML

I am transforming a mixed html and xml document using an xslt stylesheet and extracting only the html elements.

Source file:

<?xml version="1.0" encoding="utf-8" ?>
<html >
  <head>
    <title>Simplified Example Form</title>
  </head>
  <body>
    <TLA:document xmlns:TLA="http://www.TLA.com">
      <TLA:contexts>
        <TLA:context id="id_1" value=""></TLA:context>
      </TLA:contexts>
      <table id="table_logo" style="display:inline">
        <tr>
          <td height="20" align="middle">Big Title Goes Here</td>
        </tr>
        <tr>
          <td align="center">
            <img src="logo.jpg" border="0"></img>
          </td>
        </tr>
      </table>
      <TLA:page>
        <TLA:question id="q_id_1">
          <table id="table_id_1">
            <tr>
              <td>Label text goes here</td>
              <td>
                <input id="input_id_1" type="text"></input>
              </td>
            </tr>
          </table>
        </TLA:question>
      </TLA:page>
      <!-- Repeat many times -->
    </TLA:document>
  </body>
</html>

Stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:TLA="http://www.TLA.com" exclude-result-prefixes="TLA">
  <xsl:output method="html" indent="yes" version="4.0" />
  <xsl:strip-space elements="*" />

  <xsl:template match="@*|node()" priority="-2">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- This element-only identity template prevents the 
       TLA namespace declaration from being copied to the output -->
  <xsl:template match="*">
    <xsl:element name="{name()}">
      <xsl:apply-templates select="@* | node()" />
    </xsl:element>
  </xsl:template>

  <!-- Pass processing on to child elements of TLA elements -->
  <xsl:template match="TLA:*">
    <xsl:apply-templates select="*" />
  </xsl:template>
</xsl:stylesheet>

Output:

<html>
  <head>
    <META http-equiv="Content-Type" content="text/html; charset=utf-8">
    <title>Simplified Example Form</title>
  </head>
  <body>
    <table id="table_logo" style="display:inline">
      <tr>
        <td height="20" align="middle">Big Title Goes Here</td>
      </tr>
      <tr>
        <td align="center"><img src="logo.jpg" border="0"></td>
      </tr>
    </table>
    <table id="table_id_1">
      <tr>
        <td>Label text goes here</td>
        <td><input id="input_id_1" type="text"></td>
      </tr>
    </table>
  </body>
</html>

However there's a problem in that the meta, img, and input elements are not being closed correctly. I've set the xsl:output to html and the version to 4.0 so as far as I know they should output correct html.

I'm guessing that there needs to be a subtle change in the first xsl:template/xsl:copy instruction but my xslt skills are highly limited.

What change needs to be made to get the tags to close correctly?

P.S. I'm not sure if there's a difference between different tools/parsers but I'm using Visual Studio 2012 to debug the stylesheet so that I can see the immediate effect of any changes.

Upvotes: 1

Views: 1285

Answers (2)

Martin Honnen
Martin Honnen

Reputation: 167516

I am afraid you don't understand the syntax rules for SGML based HTML which HTML 4 or 4.01 is: the correct markup for an empty element is <input>, it is not <input></input> nor <input/> nor <input />.

So with your request of the HTML output method and version you get the correct HTML syntax when the result tree of your XSLT transformation is serialized.

Check for instance http://validator.w3.org/check?uri=http%3A%2F%2Fhome.arcor.de%2Fmartin.honnen%2Fxslt%2Ftest2013040901Result.html&charset=%28detect+automatically%29&doctype=Inline&group=0, there are no errors or warnings on elements not being closed properly in there.

However with http://validator.w3.org/check?uri=http%3A%2F%2Fhome.arcor.de%2Fmartin.honnen%2Fxslt%2Ftest2013040902Result.html&charset=%28detect+automatically%29&doctype=Inline&group=0 you get warnings that elements are incorrectly closed.

So the html output method does the right thing, see also http://www.w3.org/TR/xslt#section-HTML-Output-Method which says:

The html output method should not output an end-tag for empty elements. For HTML 4.0, the empty elements are area, base, basefont, br, col, frame, hr, img, input, isindex, link, meta and param. For example, an element written as <br/> or <br></br> in the stylesheet should be output as <br>.

Upvotes: 2

Eero Helenius
Eero Helenius

Reputation: 2585

The <meta>, <img> and <input> elements don't need to be closed — it's still valid HTML.

If you want to have them closed, you could use xml (with XSLT2.0 you could use xhtml, too, as far as I know) as the output method and add the <meta> tag yourself if you need it. For example:

Stylesheet

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:TLA="http://www.TLA.com" exclude-result-prefixes="TLA">
  <xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
  <xsl:strip-space elements="*" />

  <xsl:template match="@*|node()" priority="-2">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="head">
    <xsl:copy>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- This element-only identity template prevents the 
       TLA namespace declaration from being copied to the output -->
  <xsl:template match="*">
    <xsl:element name="{name()}">
      <xsl:apply-templates select="@* | node()" />
    </xsl:element>
  </xsl:template>

  <!-- Pass processing on to child elements of TLA elements -->
  <xsl:template match="TLA:*">
    <xsl:apply-templates select="*" />
  </xsl:template>
</xsl:stylesheet>

Output

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <title>Simplified Example Form</title>
  </head>
  <body>
    <table id="table_logo" style="display:inline">
      <tr>
        <td height="20" align="middle">Big Title Goes Here</td>
      </tr>
      <tr>
        <td align="center">
          <img src="logo.jpg" border="0"/>
        </td>
      </tr>
    </table>
    <table id="table_id_1">
      <tr>
        <td>Label text goes here</td>
        <td>
          <input id="input_id_1" type="text"/>
        </td>
      </tr>
    </table>
  </body>
</html>

Upvotes: 1

Related Questions