zsquare
zsquare

Reputation: 10146

Extract text based on previous and next sibling

I'm trying to extract data from the following structure:

<span>Heading</span>
<br />
<br />
<span>Heading1</span>
<br />
data#1
<br />
<br />
<span>Heading4</span><br />
&acirc;&euro;&cent; data#4.1
<br />
&acirc;&euro;&cent; data#4.2
<br />
&acirc;&euro;&cent; data#4.3
<br />
&acirc;&euro;&cent; data#4.4
<br />
<br />
<span>Heading5</span>
<br />
&acirc;&euro;&cent; data#5.1
<br />
&acirc;&euro;&cent; data#5.2
<br />
&acirc;&euro;&cent; data#5.3
<br />
<br />

I can extract data#1 using something like this:

span[text()='Heading1']/following-sibling::br[1]/following::text()[1]

But I cant figure out how to extract the data under Heading4. I need to extract data#4.1, data#4.2, data#4.3 & data#4.4. The number of points is not fixed and can vary.

Upvotes: 2

Views: 5778

Answers (4)

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243529

This XPath 1.0 expression selects exactly the wanted nodes:

  /*/span[.='Heading4']
        /following-sibling::text()
           [count(.|/*/span[.='Heading5']/preceding-sibling::text())
           =
            count(/*/span[.='Heading5']/preceding-sibling::text())
            ]
                  [normalize-space()]

It is produced from the well-known Kayessian method for intersection of two nodesets $ns1 and $ns2:

$ns1[count(.|$ns2) = count($ns2)]

We obtain the first expression above if in the Kayessian formula we substitute $ns1 with:

  /*/span[.='Heading4']/following-sibling::text()

and $ns2 with:

  /*/span[.='Heading5']/preceding-sibling::text()

The final predicate [normalize-space()] filters out the whitespace-only text nodes from this intersection.

XSLT-based verification:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:template match="/">
     <xsl:copy-of select=
      "/*/span[.='Heading4']
            /following-sibling::text()
               [count(.|/*/span[.='Heading5']/preceding-sibling::text())
               =
                count(/*/span[.='Heading5']/preceding-sibling::text())
                ]
                [normalize-space()]
      "/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document (with the entities replaced -- as we don't have a DTD defining them available and this isn't essential here):

<html>
    <span>Heading</span>
    <br />
    <br />
    <span>Heading1</span>
    <br /> data#1 
    <br />
    <br />
    <span>Heading4</span>
    <br /> #acirc;#euro;#cent; data#4.1 
    <br /> #acirc;#euro;#cent; data#4.2 
    <br /> #acirc;#euro;#cent; data#4.3 
    <br /> #acirc;#euro;#cent; data#4.4 
    <br />
    <br />
    <span>Heading5</span>
    <br /> #acirc;#euro;#cent; data#5.1 
    <br /> #acirc;#euro;#cent; data#5.2 
    <br /> #acirc;#euro;#cent; data#5.3 
    <br />
    <br />
</html>

the Xpath expression is evaluated and the result of this evaluation is copied to the output:

 #acirc;#euro;#cent; data#4.1 
     #acirc;#euro;#cent; data#4.2 
     #acirc;#euro;#cent; data#4.3 
     #acirc;#euro;#cent; data#4.4 

Upvotes: 3

zsquare
zsquare

Reputation: 10146

I finally ended up using this, with help from the answer here

//text()[preceding-sibling::span[1] = 'Heading4']

Upvotes: 1

BeniBela
BeniBela

Reputation: 16917

You can use

span[text()='Heading4']/following-sibling::text()[. != ""] 

to get all the text after Heading4 and then use.

span[text()='Heading5']/following-sibling::text()[. != ""]

to get the text after Heading5 that you don't want, and then subtract the second result set from the first in your main program.

And if you have XPath 2, you can exclude them directly with the except operator:

span[text()='Heading4']/following-sibling::text()[. != ""] except span[text()='Heading5']/following::text()[. != ""]

You can get only the data without the &acirc;&euro;&cent; before with the substring(.,5) function, so the final XPath 2 expression becomes:

(span[text()='Heading4']/following-sibling::text()[. != ""] except span[text()='Heading5']/following::text()[. != ""])/substring(., 5)

And since you haven't explicitly said your language requirement you might also want to look at my pascal based query language, because it is imho way much nicer:

 <span>Heading4</span><br />
 <t:loop>
    {filter(text(), "data.*")}<br/>
 </t:loop>
 <br/>
 <span>Heading5</span><br />

Upvotes: 2

Mike
Mike

Reputation: 2153

I'd use

span[text()='Heading4']/following-sibling::text()

and then parse resulting text separately.

Upvotes: 0

Related Questions