Reputation: 145
I'm trying to use XSLT to transform XML into a plaintext file for loading into a database. One of the elements I need, however, might contain HTML formatted text which I need to preserve, and newlines and whitespace which I don't. I also don't want the XML namespace.
The file is large and more complicated, but the problem should be covered by the following example.
XML:
<outer xmlns="urn:site-org:v3/m2" >
<inner>
<text>
<p>This is text with markup</p>
<p>This is text with <i>more</i> markup</p>
</text>
</inner>
<inner>
<text>
Need text with no markup also
</text>
</inner>
</outer>
Desired output:
<p>This is text with markup</p><p>This is text with <i>more</i> markup</p>
Need text with no markup also
With an output format of text, normalize-space() cleans up all the newlines and whitespace, but also removes the tags.
I've tried using xml output and xsl:copy-of, but this leaves the line breaks, and the namespace, and character encodes some of my other output (&
-> &
) which is undesirable.
Thanks in advance for any ideas!
Upvotes: 0
Views: 576
Reputation: 101738
The key to removing the whitespace without removing the elements is to make proper use of templates and only remove whitespace from text nodes, not from entire elements.
I'm not 100% clear on your requirements, but this should at least come very close:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:m2="urn:site-org:v3/m2">
<xsl:output method="xml" indent="no" omit-xml-declaration="yes" />
<!-- Remove any whitespace between elements -->
<xsl:strip-space elements="*" />
<xsl:template match="m2:text">
<xsl:apply-templates />
<!-- Newline -->
<xsl:text>
</xsl:text>
</xsl:template>
<!-- Copy elements beneath text elements, without their namespace-->
<xsl:template match="m2:text//*">
<xsl:element name="{local-name()}">
<xsl:apply-templates select="@* | node()" />
</xsl:element>
</xsl:template>
<!-- Copy attributes beneath text elements-->
<xsl:template match="m2:text//@*">
<xsl:copy />
</xsl:template>
<!-- Text nodes in HTML content - normalize space but escape entities -->
<xsl:template match="m2:text[.//*]//text()" priority="5">
<xsl:value-of select="normalize-space()"/>
</xsl:template>
<!-- Text nodes in HTML content - normalize space and don't escape entities -->
<xsl:template match="m2:text//text()">
<xsl:value-of select="normalize-space()" disable-output-escaping="yes"/>
</xsl:template>
</xsl:stylesheet>
When run on the following input:
<outer xmlns="urn:site-org:v3/m2" >
<inner>
<text>
<p class="snazzy">This is text with markup and &&& ampersands</p>
<p>This is text with <i>more</i> markup</p>
</text>
</inner>
<inner>
<text>
Need text with no markup also and some &&& ampersands
</text>
</inner>
</outer>
The result is:
<p class="snazzy">This is text with markup and &&& ampersands</p><p>This is text with<i>more</i>markup</p>
Need text with no markup also and some &&& ampersands
Upvotes: 3