user2123203
user2123203

Reputation: 145

XSLT to normalise whitespace and but leave inner HTML

I'm trying to use XSLT to transform XML into a plaintext file for loading into a database. One of the elements I need, however, might contain HTML formatted text which I need to preserve, and newlines and whitespace which I don't. I also don't want the XML namespace.

The file is large and more complicated, but the problem should be covered by the following example.

XML:

<outer xmlns="urn:site-org:v3/m2" >
  <inner>
    <text>
      <p>This is text with markup</p>
      <p>This is text with <i>more</i> markup</p>
    </text>
  </inner>
  <inner>
    <text>
      Need text with no markup also
    </text>
  </inner>
</outer>

Desired output:

<p>This is text with markup</p><p>This is text with <i>more</i> markup</p>
Need text with no markup also

With an output format of text, normalize-space() cleans up all the newlines and whitespace, but also removes the tags.

I've tried using xml output and xsl:copy-of, but this leaves the line breaks, and the namespace, and character encodes some of my other output (& -> &amp;) which is undesirable.

Thanks in advance for any ideas!

Upvotes: 0

Views: 576

Answers (1)

JLRishe
JLRishe

Reputation: 101738

The key to removing the whitespace without removing the elements is to make proper use of templates and only remove whitespace from text nodes, not from entire elements.

I'm not 100% clear on your requirements, but this should at least come very close:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:m2="urn:site-org:v3/m2">
  <xsl:output method="xml" indent="no" omit-xml-declaration="yes" />
  <!-- Remove any whitespace between elements -->
  <xsl:strip-space elements="*" />

  <xsl:template match="m2:text">
    <xsl:apply-templates />
    <!-- Newline -->
    <xsl:text>&#xA;</xsl:text>
  </xsl:template>

  <!-- Copy elements beneath text elements, without their namespace-->
  <xsl:template match="m2:text//*">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@* | node()" />
    </xsl:element>
  </xsl:template>

  <!-- Copy attributes beneath text elements-->
  <xsl:template match="m2:text//@*">
    <xsl:copy />
  </xsl:template>

  <!-- Text nodes in HTML content - normalize space but escape entities -->
  <xsl:template match="m2:text[.//*]//text()" priority="5">
    <xsl:value-of select="normalize-space()"/>
  </xsl:template>

  <!-- Text nodes in HTML content - normalize space and don't escape entities -->
  <xsl:template match="m2:text//text()">
    <xsl:value-of select="normalize-space()" disable-output-escaping="yes"/>
  </xsl:template>

</xsl:stylesheet>

When run on the following input:

<outer xmlns="urn:site-org:v3/m2" >
  <inner>
    <text>
      <p class="snazzy">This is text with markup and &amp;&amp;&amp; ampersands</p>
      <p>This is text with <i>more</i> markup</p>
    </text>
  </inner>
  <inner>
    <text>
      Need text with no markup also and some &amp;&amp;&amp; ampersands 
    </text>
  </inner>
</outer>

The result is:

<p class="snazzy">This is text with markup and &amp;&amp;&amp; ampersands</p><p>This is text with<i>more</i>markup</p>
Need text with no markup also and some &&& ampersands

Upvotes: 3

Related Questions