Reputation: 17
I'm very new to using Xpath. I'm trying to extract some information from a Law & Regulation website, right now I just want to:
<b>
tag.<p>
<b> <span> Article 1. </span> </b>
<span>
To approve the master plan on development
of tourism in Northern Central Vietnam
with the following principal contents:
</span>
</p>
<p>
<span>
1. Development viewpoints
</span>
</p>
<p>
<span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
</span>
</p>
<p>
<span>PRIME MINISTER: Nguyen Tan Dung</span>
</p>
<p>
<span>
<b> PRIME MINISTER </b>
</span>
</p>
<p>
<b> <span> Article 2. </span> </b>
<span>
.................
</span>
</p>
<p>
<span> PRIME MINISTER: Nguyen Tan Dung</span>
</p>
The expected output, I should have a list that's similar to
[
'Article 1.' ,
'To approve the master plan on development of tourism in Northern
Central Vietnam with the following principal contents: ',
'1. Development viewpoints' ,
'To realize general viewpoints of the strategy for and master plan on
development of Vietnam’s tourism through 2020.' ,
'PRIME MINISTER: Nguyen Tan Dung',
'PRIME MINISTER'
]
First item in List is "Article 1." and last item in the list is "PRIME MINISTER" that is inside a <b>
tag
Upvotes: 1
Views: 186
Reputation: 243459
A single, plain XPath 1.0 expression:
/*/p[starts-with(normalize-space(), 'Article 1.')]
[1]
| /*/p[starts-with(normalize-space(), 'Article 1.')]
[1]/following-sibling::p
[not(preceding-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')])
and
following-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')]
and not(starts-with(normalize-space(), 'PRIME MINISTER'))
]
When evaluated against this XML document:
<html>
<p>
<b> <span> Article 1. </span> </b>
<span>
To approve the master plan on development
of tourism in Northern Central Vietnam
with the following principal contents:
</span>
</p>
<p>
<span>
1. Development viewpoints
</span>
</p>
<p>
<span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
</span>
</p>
<p>
<span>PRIME MINISTER: Nguyen Tan Dung</span>
</p>
<p>
<span>
<b> PRIME MINISTER </b>
</span>
</p>
<p>
<b> <span> Article 2. </span> </b>
<span>
.................
</span>
</p>
<p>
<span> PRIME MINISTER: Nguyen Tan Dung</span>
</p>
</html>
it selects exactly the wanted <p>
elements.
Verification:
This XSLT transformation evaluates the XPath expression and outputs all nodes selected in this evaluation:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/p[starts-with(normalize-space(), 'Article 1.')]
[1]
| /*/p[starts-with(normalize-space(), 'Article 1.')]
[1]/following-sibling::p
[not(preceding-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')])
and
following-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')]
and not(starts-with(normalize-space(), 'PRIME MINISTER'))
]
"/>
</xsl:template>
</xsl:stylesheet>
When applied against the same XML document (above), the wanted result is produced:
<p>
<b>
<span> Article 1. </span>
</b>
<span>
To approve the master plan on development
of tourism in Northern Central Vietnam
with the following principal contents:
</span>
</p>
<p>
<span>
1. Development viewpoints
</span>
</p>
<p>
<span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
</span>
</p>
and it is displayed by the browser as intended:
Article 1. To approve the master plan on development of tourism in Northern Central Vietnam with the following principal contents:
1. Development viewpoints
To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
Upvotes: 0
Reputation: 163262
"Until" and "Between" queries are surprisingly difficult in XPath, even with later XPath versions than 1.0.
If we work back from later versions, in XPath 3.1 you can do something like this:
let $first := p[contains(., 'Article 1')],
$last := p[contains(., 'PRIME MINISTER']
return $first, p[. >> $first and . << $last], $last
In XPath 2.0 we don't have let
, but for
works just as well, it just reads a bit oddly.
But in 1.0 (a) we can't bind variables, and (b) we don't have the <<
and >>
operators, which makes it much more difficult.
The simplest expression is probably
p[(.|preceding-sibling::p)[contains(., 'Article 1')] and
(.|following-sibling::p)[contains(., 'PRIME MINISTER')]]
Unfortunately, without an incredibly smart optimizer, that's likely to be horrendously inefficient with a large input document (both the contains() tests will be executed around (N^2)/2 times where N is the number of paragraphs). If you're constrained to XPath 1.0 then you might be best off using XPath to find the "start" and "end" nodes, and then using the host language to find all the nodes in between.
Upvotes: 3
Reputation: 14135
Here is the xpath that matches the exact requirement in the OP.
//span[normalize-space(.)='Article 1.']/ancestor::p|//p[//span[normalize-space(.)='Article 1.']]/following::*[count(following-sibling::p/span/b[normalize-space(.)='PRIME MINISTER'])=1]
Screenshot:
Upvotes: 0
Reputation: 24930
This xpath expression:
//p[descendant-or-self::p and (following-sibling::p/descendant::b)]
should get you your expected output, at least on the html code you posted.
Upvotes: 0