Dominik Filipiak
Dominik Filipiak

Reputation: 1272

XPath - extracting text between two nodes

I'm encountering a problem with my XPath query. I have to parse a div which is divided to unknown number of "sections". Each of these is separated by h5 with a section name. The list of possible section titles is known and each of them can occur only once. Additionally, each section can contain some br tags. So, let's say I want to extract the text under "SecondHeader".

HTML

<div class="some-class">
 <h5>FirstHeader</h5>
  text1
 <h5>SecondHeader</h5>
  text2a<br>
  text2b
 <h5>ThirdHeader</h5>
  text3a<br>
  text3b<br>
  text3c<br>
 <h5>FourthHeader</h5>
  text4
</div>

Expected result (for SecondSection)

['text2a', 'text2b']

Query #1

//text()[following-sibling::h5/text()='ThirdHeader']

Result #1

['text1', 'text2a', 'text2b']

It's obviously bit too much, so I've decided to restrict the result to the content between selected header and the header before.

Query #2

//text()[following-sibling::h5/text()='ThirdHeader' and preceding-sibling::h5/text()='SecondHeader']

Result #2

['text2a', 'text2b']

Yielded results meet the expectations. However, this can't be used - I don't know whether SecondHeader/ThirdHeader will exist in parsed page or not. It is needed to use only one section title in a query.

Query #3

//text()[following-sibling::h5/text()='ThirdHeader' and not[preceding-sibling::h5/text()='ThirdHeader']]

Result #3

[]

Could you please tell me what am I doing wrong? I've tested it in Google Chrome.

Upvotes: 2

Views: 2768

Answers (2)

Daniel Haley
Daniel Haley

Reputation: 52888

You should be able to just test the first preceding sibling h5...

//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]

Upvotes: 1

paul trmbrth
paul trmbrth

Reputation: 20748

If all h5 elements and text nodes are siblings, and you need to group by section, a possible option is simply to select text nodes by count of h5 that come before.

Example using lxml (in Python)

>>> import lxml.html
>>> s = '''
... <div class="some-class">
...  <h5>FirstHeader</h5>
...   text1
...  <h5>SecondHeader</h5>
...   text2a<br>
...   text2b
...  <h5>ThirdHeader</h5>
...   text3a<br>
...   text3b<br>
...   text3c<br>
...  <h5>FourthHeader</h5>
...   text4
... </div>'''
>>> doc = lxml.html.fromstring(s)
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=1)
['\n  text1\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=2)
['\n  text2a', '\n  text2b\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=3)
['\n  text3a', '\n  text3b', '\n  text3c', '\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=4)
['\n  text4\n']
>>> 

Upvotes: 2

Related Questions