AsTeR
AsTeR

Reputation: 7521

Extract text through XSL by skipping content within given children

I'm trying to extract the text of an interesting node (here big-structured-text) but within this node there are some children I would like to skip (here title, subtitle, and code). Those "to remove" nodes can have children.

Sample data:

<root>
    <big-structured-text>
        <section>
            <title>Introduction</title>
            In this part we describe Australian foreign policy....
            <subsection>
                <subtitle>Historical context</subtitle>
                After its independence...
                <meta>
                    <keyword>foreign policy</keyword>
                    <keyword>australia</keyword>
                    <code>
                        <value>XXHY-123</value>
                        <label>IRRN</label>
                    </code>
                </meta>
            </subsection>
        </section>
    </big-structured-text>
    <!-- ... -->
    <big-structured-text>
        <!-- ... -->
    </big-structured-text>
</root>

So far I've tried:

<xsl:for-each
     select="//big-structured-text">
         <text>
             <xsl:value-of select=".//*[not(*)
                 and not(ancestor-or-self::code)
                 and not(ancestor-or-self::subtitle)
                 and not(ancestor-or-self::title)
                 ]" />
         </text>
</xsl:for-each>

but this does just take the node that don't have any children, it will take keyword but not the text following the introduction title

I've also tried:

<xsl:for-each
     select="//big-structured-text">
         <text>
             <xsl:value-of select=".//*[
                 not(ancestor-or-self::code)
                 and not(ancestor-or-self::subtitle)
                 and not(ancestor-or-self::title)
                 ]" />
         </text>
</xsl:for-each>

But this is echoing multiple time the interesting text and sometime the uninteresting one (every node is iterate once for itself and then one time per ancestor).

Upvotes: 1

Views: 198

Answers (1)

Ian Roberts
Ian Roberts

Reputation: 122364

Rather than for-each you could approach this using templates. The default behaviour when you apply-templates to an element node is simply to recursively apply them to all its child nodes (which includes text nodes as well as other elements), and for a text node to output the text. Therefore all you need to do is create empty templates to squash the elements you don't want and then let the default templates do the rest.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <xsl:template match="/">
    <root>
      <xsl:apply-templates select="/root/big-structured-text" />
    </root>
  </xsl:template>

  <xsl:template match="big-structured-text">
    <text><xsl:apply-templates /></text>
  </xsl:template>

  <!-- empty template means anything inside any of these elements will be
       ignored -->
  <xsl:template match="title | subtitle | code" />
</xsl:stylesheet>

When run on your sample input this produces

<?xml version="1.0"?>
<root><text>


            In this part we describe Australian foreign policy....


                After its independence...

                    foreign policy
                    australia




    </text><text>

    </text></root>

You may wish to investigate the use of <xsl:strip-space> to get rid of some of the extraneous whitespace, but with mixed content you always have to be careful not to strip out too much.

Upvotes: 2

Related Questions