Matthew Wilkens
Matthew Wilkens

Reputation: 103

Extract part of an XML file as plain text using XSLT

Seems like this should be easy, but ...

I'm trying to use XSLT to extract part of an XML file as plain text, throwing away the rest.

So from sample input like this ...

<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="http://segonku.unl.edu/teianalytics/TEIAnalytics.rng"
                        type="xml"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" n="Wright2-0034.sgml.xml">
   <teiHeader type="text">
      <fileDesc>
         <titleStmt>
            <title>Header Title</title>
         </titleStmt>
         <publicationStmt>
            <p>Published</p>
         </publicationStmt>
         <sourceDesc>
            <p>Sourced</p>
         </sourceDesc>
      </fileDesc>
   </teiHeader>
   <text>
      <front>
      </front>
      <body>
         <head>THE TITLE</head>
         <div type="chapter" part="N" org="uniform" sample="complete">
            <head>CHAPTER I</head>
            <p>Some text.</p>
         </div>
      </body>
   </text>
</TEI>

... I'm trying to get just the text contained within the <body> tags and all their children. The desired output in this case is:

THE TITLE
CHAPTER I
Some text.

Potential complication: <body> can also exist in the <front> matter and/or in the <teiHeader>, so what I really need is the children of <body> if and only if that tag is a child of <text> and of <TEI>.

I've tried really simple XSL like this ...

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="text"/>
    <xsl:template match="/TEI/text/body">
        <xsl:apply-templates select="."/>
    </xsl:template>
</xsl:stylesheet>

... but it gives me plain text of everything in the file, not just the <body> elements.

Thanks!

Upvotes: 10

Views: 8983

Answers (3)

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243479

I've tried really simple XSL like this ...

...

     <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0">
         <xsl:output method="text"/>
         <xsl:template match="/TEI/text/body">
             <xsl:apply-templates select="."/>
         </xsl:template>
     </xsl:stylesheet>

... but it gives me plain text of everything in the file, not just the <body> elements.

The reason for this is a famous property/feature of XPath (and reason for many thousands similar questions) to consider any unprefixed name as belonging to "no namespace. However, any element in the provided XML document belongs to the namespace: "http://www.tei-c.org/ns/1.0" and must be accessed as a node in this namespace.

Solution: Define the documents default namespace in the XSLT code (this time with a prefix bound to it) and use the prefix in specifying every name.

This is one of the simplest and shortest possible transformations that produces the wanted result:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:x="http://www.tei-c.org/ns/1.0">
 <xsl:output method="text"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="x:text/x:body//text()">
  <xsl:value-of select="concat(.,'&#xA;')"/>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

When applied on the provided XML document:

<TEI xmlns="http://www.tei-c.org/ns/1.0" n="Wright2-0034.sgml.xml">
    <teiHeader type="text">
        <fileDesc>
            <titleStmt>
                <title>Header Title</title>
            </titleStmt>
            <publicationStmt>
                <p>Published</p>
            </publicationStmt>
            <sourceDesc>
                <p>Sourced</p>
            </sourceDesc>
        </fileDesc>
    </teiHeader>
    <text>
        <front>      </front>
        <body>
            <head>THE TITLE</head>
            <div type="chapter" part="N" org="uniform" sample="complete">
                <head>CHAPTER I</head>
                <p>Some text.</p>
            </div>
        </body>
    </text>
</TEI>

the wanted, correct result is produced:

THE TITLE
CHAPTER I
Some text.

Upvotes: 9

Grzegorz Szpetkowski
Grzegorz Szpetkowski

Reputation: 37934

You can use:

<xsl:strip-space elements="*"/>

and

<xsl:template match="/" xmlns:n="http://www.tei-c.org/ns/1.0">
    <xsl:for-each select="/n:TEI/n:text/n:body/descendant::*/text()">
        <xsl:value-of select="."/>
        <xsl:if test="position() != last()">
            <xsl:text>&#xa;</xsl:text>
        </xsl:if>
    </xsl:for-each>
</xsl:template>

It returns:

THE TITLE
CHAPTER I
Some text.

Upvotes: 2

k_b
k_b

Reputation: 2480

Try matching /TEI/text/body//text()

Upvotes: 0

Related Questions