zachberry
zachberry

Reputation: 97

Generating a flat list of element text positions from a nested XML structure with XSLT

I'm looking to take an example XML such as

<alpha>
  <beta>
    Here is <x>some text <y>and more</y> text</x> with <x>a little more text</x>!
  </beta>
</alpha>

And I'm attempting to use XSL to transform this into something like

x:8,31
x:37,55
y:18,26

The exact formatting isn't critical, the primary task I'm trying to figure out is grabbing the positions in the text of the various <x> and <y> elements (which can be represented multiple times and nested, as shown in the example where I have two <x> elements and there's a <y> element inside a <x> element). So the desired output above is saying that an x element starts at text position 8 and ends at text position 31 inside the <beta> element. There's another x element from 37 to 55 and there's a y element from 18 to 26. The order of the output list is not important.

I've seen mentions of substring-before and count but I can't figure out how these work when nested with an unknown amount of nesting or with the possibility of several of the same elements present at various portions of the text.

Is something like this possible with just XSLT?

Upvotes: 1

Views: 143

Answers (2)

Daniel Haley
Daniel Haley

Reputation: 52878

Here's an XSLT 1.0 option that is very similar to Martin's...

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="text()"/>

  <xsl:template match="x|y">
    <xsl:variable name="ancestor" 
      select="generate-id(ancestor::*[not(self::x) and not(self::y)][1])"/>
    <xsl:variable name="preceding">
      <xsl:for-each select="preceding::text()[ancestor::*[generate-id()=$ancestor]]">
        <xsl:value-of select="."/>
      </xsl:for-each>
    </xsl:variable>    
    <xsl:value-of select="concat(name(),':',string-length($preceding),',',
      string-length($preceding) + string-length(),'&#xA;')"/>
    <xsl:apply-templates/>
  </xsl:template>

</xsl:stylesheet>

Like Martin mentioned, the whitespace in beta is significant because it contains mixed content (both text and element children).

If you remove the leading/trailing spaces...

<alpha>
    <beta>Here is <x>some text <y>and more</y> text</x> with <x>a little more text</x>!</beta>
</alpha>

the output is as requested...

x:8,31
y:18,26
x:37,55

Upvotes: 3

Martin Honnen
Martin Honnen

Reputation: 167696

Using

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:output method="text"/>

    <xsl:template match="node()">
        <xsl:apply-templates/>
    </xsl:template>

    <xsl:template match="beta//*">
        <xsl:variable name="preceding-length" select="sum((preceding::text() intersect ancestor::beta//text())/string-length())"/>
        <xsl:value-of select="local-name(), ': ', $preceding-length, ', ', $preceding-length + string-length()"/>
        <xsl:text>&#10;</xsl:text>
        <xsl:apply-templates/>
    </xsl:template>

    <xsl:template match="text()"/>

</xsl:stylesheet>

and an XSLT 2.0 processor your sample gives me

x :  11 ,  34
y :  21 ,  29
x :  40 ,  58

The offset of 3 to your desired results might be white-space.

Upvotes: 2

Related Questions