How to get all elements between two nodes with XPATH?

Question

I have HTML code like this:



    
        
        
        test
    
    
        Title
        para1
        para2
        para3
        Title
        para4
        para5

What I want is:

para1
para2
para3

So I want to get the first part in this html. I need to ignore the second part.

for now I just work out this way:

#!/usr/bin/env python
# encoding: utf-8

import unittest

from lxml import etree

class SearchPara(unittest.TestCase):

    def setUp(self):
        with open('test.html') as f:
            self.html = f.read()

    def test_parse_html(self):
        paras = ''
        page = etree.HTML(self.html)
        a_ele = page.xpath("//h3/a[@name='title1']/..")

        if a_ele is None or len(a_ele) < 1:
            return paras

        para = a_ele[0].xpath('following-sibling::*[1][name(.) != "h3"]')
        while para is not None and len(para) > 0:
            print para
            paras += etree.tostring(para[0])
            para = para[0].xpath('following-sibling::*[1][name(.) != "h3"]')

        print paras


    def tearDown(self):
      pass

if __name__ == "__main__":
    unittest.main()

As you can see, this is a little bit complicated, what I want to know if I have beeter way to do this?

har07 · Accepted Answer

As far as I know, there is no general way to select elements between 2 elements using XPath 1.0.

The same output still can be achieved if we can define the assertion differently. For example, by selecting

s having nearest preceding sibling value equals "Title: Part I" :

//div[preceding-sibling::a[1][. = 'Title: Part I']]

and selecting the next

s group only require changing the criteria :

//div[preceding-sibling::a[1][. = 'Title: Part II']]

The test method to see above xpath in action :

def test_parse_html(self):
    page = etree.HTML(self.html)
    paras = ''
    para = page.xpath("//div[preceding-sibling::a[1][. = 'Title: Part I']]")
    for p in para:
        paras += etree.tostring(p)

    print paras

Side note. xpath for populating a_ele in your code can be simplified this way :

a_ele = page.xpath("//a[h3 = 'Title: Part I']")

or even further, since the only text element within the is "Title: Part I" :

a_ele = page.xpath("//a[. = 'Title: Part I']")

How to get all elements between two nodes with XPATH?

Answers (1)

Related Questions