Reputation: 10146
I'm trying to extract data from the following structure:
<span>Heading</span>
<br />
<br />
<span>Heading1</span>
<br />
data#1
<br />
<br />
<span>Heading4</span><br />
• data#4.1
<br />
• data#4.2
<br />
• data#4.3
<br />
• data#4.4
<br />
<br />
<span>Heading5</span>
<br />
• data#5.1
<br />
• data#5.2
<br />
• data#5.3
<br />
<br />
I can extract data#1 using something like this:
span[text()='Heading1']/following-sibling::br[1]/following::text()[1]
But I cant figure out how to extract the data under Heading4. I need to extract data#4.1
, data#4.2
, data#4.3
& data#4.4
.
The number of points is not fixed and can vary.
Upvotes: 2
Views: 5778
Reputation: 243529
This XPath 1.0 expression selects exactly the wanted nodes:
/*/span[.='Heading4']
/following-sibling::text()
[count(.|/*/span[.='Heading5']/preceding-sibling::text())
=
count(/*/span[.='Heading5']/preceding-sibling::text())
]
[normalize-space()]
It is produced from the well-known Kayessian method for intersection of two nodesets $ns1
and $ns2
:
$ns1[count(.|$ns2) = count($ns2)]
We obtain the first expression above if in the Kayessian formula we substitute $ns1
with:
/*/span[.='Heading4']/following-sibling::text()
and $ns2
with:
/*/span[.='Heading5']/preceding-sibling::text()
The final predicate [normalize-space()]
filters out the whitespace-only text nodes from this intersection.
XSLT-based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/span[.='Heading4']
/following-sibling::text()
[count(.|/*/span[.='Heading5']/preceding-sibling::text())
=
count(/*/span[.='Heading5']/preceding-sibling::text())
]
[normalize-space()]
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document (with the entities replaced -- as we don't have a DTD defining them available and this isn't essential here):
<html>
<span>Heading</span>
<br />
<br />
<span>Heading1</span>
<br /> data#1
<br />
<br />
<span>Heading4</span>
<br /> #acirc;#euro;#cent; data#4.1
<br /> #acirc;#euro;#cent; data#4.2
<br /> #acirc;#euro;#cent; data#4.3
<br /> #acirc;#euro;#cent; data#4.4
<br />
<br />
<span>Heading5</span>
<br /> #acirc;#euro;#cent; data#5.1
<br /> #acirc;#euro;#cent; data#5.2
<br /> #acirc;#euro;#cent; data#5.3
<br />
<br />
</html>
the Xpath expression is evaluated and the result of this evaluation is copied to the output:
#acirc;#euro;#cent; data#4.1
#acirc;#euro;#cent; data#4.2
#acirc;#euro;#cent; data#4.3
#acirc;#euro;#cent; data#4.4
Upvotes: 3
Reputation: 10146
I finally ended up using this, with help from the answer here
//text()[preceding-sibling::span[1] = 'Heading4']
Upvotes: 1
Reputation: 16917
You can use
span[text()='Heading4']/following-sibling::text()[. != ""]
to get all the text after Heading4 and then use.
span[text()='Heading5']/following-sibling::text()[. != ""]
to get the text after Heading5 that you don't want, and then subtract the second result set from the first in your main program.
And if you have XPath 2, you can exclude them directly with the except
operator:
span[text()='Heading4']/following-sibling::text()[. != ""] except span[text()='Heading5']/following::text()[. != ""]
You can get only the data
without the •
before with the substring(.,5)
function, so the final XPath 2 expression becomes:
(span[text()='Heading4']/following-sibling::text()[. != ""] except span[text()='Heading5']/following::text()[. != ""])/substring(., 5)
And since you haven't explicitly said your language requirement you might also want to look at my pascal based query language, because it is imho way much nicer:
<span>Heading4</span><br />
<t:loop>
{filter(text(), "data.*")}<br/>
</t:loop>
<br/>
<span>Heading5</span><br />
Upvotes: 2
Reputation: 2153
I'd use
span[text()='Heading4']/following-sibling::text()
and then parse resulting text separately.
Upvotes: 0