ilitirit
ilitirit

Reputation: 16352

Using XSL to replace child nodes with sequential ones?

The file we are receiving is being erroneously generated like this:

<html>
    <body>
        <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
            <p>Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. 
                <p>It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.</p>
            </p>
        </p>
    </body>
</html>

The <p> elements are being embedded into the previous <p> nodes. It should look like this instead:

<html>
    <body>
        <p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
        <p>Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.</p>
        <p>It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.</p>
    </body>
</html>

We don't have any control over the application that is sending us the document. How can we transform this document using XSL so that only the child nodes (and their contents) are rendered as siblings instead?

Upvotes: 1

Views: 92

Answers (2)

C. M. Sperberg-McQueen
C. M. Sperberg-McQueen

Reputation: 25034

If the only element being malgenerated in this way is p, you'll want to write a template for p that first calls apply-templates for all attributes and non-p children and then applies templates to the embedded p elements. In XSLT 2.0 syntax:

<xsl:template match="p">
  <p><xsl:apply-templates select="node() except p"/></p>
  <xsl:apply-templates select="p"/>
</xsl:template>

The rest of the stylesheet will need to perform the identity transform.

If other elements also self-nest inappropriately in the input, you'll need to handle them similarly.

If you're using XSLT 1.0 instead of 2.0, you'll need to find some other way to distinguish things that belong inside the p from things that should occur afterwards, since node() except p is not legal in an XSLT 1.0 select value. I'd use modes, myself:

<xsl:template match="p">
  <p><xsl:apply-templates mode="para-contents"/></p>
  <xsl:apply-templates select="p"/>
</xsl:template>

<xsl:template match="node()" mode="para-contents">
  <xsl:apply-templates select="."/>
</xsl:template>
<xsl:template match="p" mode="para-contents"/>

Or (as Ian Roberts has suggested in a comment) just replace node() except p with node()[not(self::p)].

This assumes that some elements other than p may occur within the body elements of your input; if nothing but p ever occurs, the solution offered by Nils Werner will do fine.

In real life, however, if I had to handle input like this I'd probably run Tidy over it instead of rolling my own XSLT stylesheet to do a small part of what Tidy does.

Upvotes: 2

Nils Werner
Nils Werner

Reputation: 36775

You can try the following:

<?xml version='1.0'?>
<xsl:stylesheet
    version='1.0'
    xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

<xsl:output method="xml" 
    indent="yes" />

<xsl:template match="/">
    <xsl:apply-templates select="html/body/*" mode="fixnested" />
</xsl:template>

<xsl:template match="*" mode="fixnested">
    <xsl:element name="{name()}">
        <xsl:apply-templates select="@* | text()" mode="fixnested" />
    </xsl:element>
    <xsl:apply-templates select="*" mode="fixnested" />
</xsl:template>

<xsl:template match="@*" mode="fixnested">
    <xsl:attribute name="{name(.)}">
        <xsl:value-of select="."/>
    </xsl:attribute>
</xsl:template>

</xsl:stylesheet>

As you can see I've held it pretty abstract so you can feed any XML to it (not just nested <p>'s) and to have it flattened. Attributes and content are preserved by these templates.

Upvotes: 0

Related Questions