Xpath following siblings until another sibling

Question

I'm new to using Xpath. I'm trying to parse some data in Python using Xpath.

Parsing the following HTML:


    
        29-04-14
        Text
        Text
    
    
        
        
        Some other text
    
    
        
        
        Some more text
    
    
        22-04-14
        Text
        Text
    
    
        
        
        more text
    
    
        30-04-14
        Text
        Text
    
    
        
        
        Some other Text
    
    
        
        
        More Text
    
    
        
        
        Some different text

I need all in following siblings after a with some values in his s, but until the next with some values in all s.

E.g. assuming my current position is the first , I would need these table cells:

    Some other text
    Some more text

Assuming my current position is the table row 4


    22-04-14
    Text
    Text

I would only need

    more text

This is the Xpath I'm using to get all sibling , but it gets me all follinwg siblings, and not until the sibling it should stop: ./following-sibling::tr/td[1][not(text()[1])]/..

I think I have to implement the Kayesian method, but I don't understand this in my case. Any help would be really apreciated!

paul trmbrth · Accepted Answer

I may be misinterpreting the question, but if, for each xx-xx-xx, you want all after it, and before the next xx-xx-xx, one pattern is to loop on these "boundary" xx-xx-xx elements, and selecting following sibling rows with a condition on how many "boundaries" are found before.

Let's use lxml to illustrate. First, we create a document from your sample input:

>>> import lxml.html
>>> t = '''
...     
...         
...         
...         
...     
...     
...         
...         
...         
...     
...     
...         
...         
...         
...     
...     
...         
...         
...         
...     
...     
...         
...         
...         
...     
...     
...         
...         
...         
...     
...     
...         
...         
...         
...     
...     
...         
...         
...         
...     
...     
...         
...         
...         
...     
... 29-04-14 Text Text
Some other text
Some more text
22-04-14 Text Text
more text
30-04-14 Text Text
Some other Text
More Text
Some different text'''
>>> doc = lxml.html.fromstring(t)

Now, let's count these xx-xx-xx:

>>> doc.xpath('//table/tr[td/@class="DT"]')
[, , ]
>>> doc.xpath('count(//table/tr[td/@class="DT"])')
3.0
>>> list(enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1))
[(1, ), (2, ), (3, )]

We can loop on these rows and select the rows that come after in the document (we'll select text nodes to "see" which row these are:

>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
...     print( row.xpath('./following-sibling::tr/td/text()') )
... 
['Some other text', 'Some more text', '22-04-14', 'Text', 'Text', 'more text', '30-04-14', 'Text', 'Text', 'Some other Text', 'More Text', 'Some different text']
['more text', '30-04-14', 'Text', 'Text', 'Some other Text', 'More Text', 'Some different text']
['Some other Text', 'More Text', 'Some different text']

We're selecting too many rows in each iteration, all the rows until the end of the

. We need an additional "end" condition for following rows.

We're counting the tr[td/@class="DT"] in the loop, so we can check how many preceding tr[td/@class="DT"] each row has:

For the 1st set:

row.xpath('./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=1]

For the 2nd:

row.xpath('./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=2]

etc.

So, in the loop, we can use the current count with an XPath variable with lxml (an underrated XPath feature supported by lxml):

>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
...     print( row.xpath('./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=$count]', count=cnt) )
... 
[, , ]
[, ]
[, , ]
>>>

Hm, we're selecting 1 row too much in each iteration.

That's because

also has 1 preceding

30-04-14

We can add an extra predicate for selecting rows that do NOT have a

>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
...     print( row.xpath('''
...         ./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=$count]
...                                [not(td/@class="DT")]''', count=cnt) )
... 
[, ]
[]
[, , ]
>>>

The number of results per iteration looks right. Let's finally check using text nodes:

>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
...     print( row.xpath('''
...         ./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=$count]
...                                [not(td/@class="DT")]
...             /td/text()''', count=cnt) )
... 
['Some other text', 'Some more text']
['more text']
['Some other Text', 'More Text', 'Some different text']
>>>

Xpath following siblings until another sibling

Answers (1)

Related Questions