Ronald S. Hong
Ronald S. Hong

Reputation: 17

Xpath to get a tag with specific strings and all of its following sibling until another specific strings is in the tag

I'm very new to using Xpath. I'm trying to extract some information from a Law & Regulation website, right now I just want to:

  1. Find a tag that contains the string "Article 1."
  2. Starting with that tag from (1) get it and also all of the contents afterward until one of the tags contains another string " PRIME Minister" in a <b> tag.
<p>
  <b> <span> Article 1. </span> </b> 
  <span> 
     To approve the master plan on development 
     of tourism in Northern Central Vietnam 
     with the following principal contents: 
  </span>
</p>

<p>
  <span>
    1. Development viewpoints
  </span>
</p>

<p>
  <span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
  </span>
</p>

<p>
  <span>PRIME MINISTER: Nguyen Tan Dung</span>
</p>

<p>
  <span>
    <b> PRIME MINISTER </b>
  </span>
</p>

<p>
  <b> <span> Article 2. </span> </b> 
  <span> 
     .................
  </span>
</p>

<p>
  <span> PRIME MINISTER: Nguyen Tan Dung</span>
</p>

The expected output, I should have a list that's similar to

[ 
'Article 1.' , 
  'To approve the master plan on development of tourism in Northern 
   Central Vietnam with the following principal contents: ',
  '1. Development viewpoints' ,
  'To realize general viewpoints of the strategy for and master plan on 
   development of Vietnam’s tourism through 2020.' ,
  'PRIME MINISTER: Nguyen Tan Dung',
  'PRIME MINISTER'
]

First item in List is "Article 1." and last item in the list is "PRIME MINISTER" that is inside a <b> tag

Upvotes: 1

Views: 186

Answers (4)

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243459

A single, plain XPath 1.0 expression:

 /*/p[starts-with(normalize-space(), 'Article 1.')]
     [1]
    | /*/p[starts-with(normalize-space(), 'Article 1.')]
          [1]/following-sibling::p
             [not(preceding-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')])
             and
               following-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')]
             and not(starts-with(normalize-space(), 'PRIME MINISTER'))
             ]

When evaluated against this XML document:

<html>
<p>
  <b> <span> Article 1. </span> </b>
  <span>
     To approve the master plan on development
     of tourism in Northern Central Vietnam
     with the following principal contents:
  </span>
</p>

<p>
  <span>
    1. Development viewpoints
  </span>
</p>

<p>
  <span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
  </span>
</p>

<p>
  <span>PRIME MINISTER: Nguyen Tan Dung</span>
</p>

<p>
  <span>
    <b> PRIME MINISTER </b>
  </span>
</p>

<p>
  <b> <span> Article 2. </span> </b>
  <span>
     .................
  </span>
</p>

<p>
  <span> PRIME MINISTER: Nguyen Tan Dung</span>
</p>
</html>

it selects exactly the wanted <p> elements.

Verification:

This XSLT transformation evaluates the XPath expression and outputs all nodes selected in this evaluation:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

  <xsl:template match="/">
    <xsl:copy-of select=
    "/*/p[starts-with(normalize-space(), 'Article 1.')]
         [1]
        | /*/p[starts-with(normalize-space(), 'Article 1.')]
              [1]/following-sibling::p
                 [not(preceding-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')])
                 and
                   following-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')]
                 and not(starts-with(normalize-space(), 'PRIME MINISTER'))
                 ]
    "/>
  </xsl:template>
</xsl:stylesheet>

When applied against the same XML document (above), the wanted result is produced:

<p>
   <b>
      <span> Article 1. </span>
   </b>
   <span>
     To approve the master plan on development
     of tourism in Northern Central Vietnam
     with the following principal contents:
  </span>
</p>
<p>
   <span>
    1. Development viewpoints
  </span>
</p>
<p>
   <span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
  </span>
</p>

and it is displayed by the browser as intended:

Article 1. To approve the master plan on development of tourism in Northern Central Vietnam with the following principal contents:

1. Development viewpoints

To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.

Upvotes: 0

Michael Kay
Michael Kay

Reputation: 163262

"Until" and "Between" queries are surprisingly difficult in XPath, even with later XPath versions than 1.0.

If we work back from later versions, in XPath 3.1 you can do something like this:

let $first := p[contains(., 'Article 1')],
    $last := p[contains(., 'PRIME MINISTER']
return $first, p[. >> $first and . << $last], $last

In XPath 2.0 we don't have let, but for works just as well, it just reads a bit oddly.

But in 1.0 (a) we can't bind variables, and (b) we don't have the << and >> operators, which makes it much more difficult.

The simplest expression is probably

p[(.|preceding-sibling::p)[contains(., 'Article 1')] and 
  (.|following-sibling::p)[contains(., 'PRIME MINISTER')]]

Unfortunately, without an incredibly smart optimizer, that's likely to be horrendously inefficient with a large input document (both the contains() tests will be executed around (N^2)/2 times where N is the number of paragraphs). If you're constrained to XPath 1.0 then you might be best off using XPath to find the "start" and "end" nodes, and then using the host language to find all the nodes in between.

Upvotes: 3

supputuri
supputuri

Reputation: 14135

Here is the xpath that matches the exact requirement in the OP.

//span[normalize-space(.)='Article 1.']/ancestor::p|//p[//span[normalize-space(.)='Article 1.']]/following::*[count(following-sibling::p/span/b[normalize-space(.)='PRIME MINISTER'])=1]

Screenshot:

date

Upvotes: 0

Jack Fleeting
Jack Fleeting

Reputation: 24930

This xpath expression:

//p[descendant-or-self::p and (following-sibling::p/descendant::b)]

should get you your expected output, at least on the html code you posted.

Upvotes: 0

Related Questions