Gino
Gino

Reputation: 13

Xpath following siblings until another sibling

I'm new to using Xpath. I'm trying to parse some data in Python using Xpath.

Parsing the following HTML:

<table>
    <tr>
        <td class="DT">29-04-14</td>
        <td class="Regio">Text</td>
        <td class="Md">Text</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td class="SomeClass">Some other text</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td class="SomeOtherClass">Some more text</td>
    </tr>
    <tr>
        <td class="DT">22-04-14</td>
        <td class="Regio">Text</td>
        <td class="Md">Text</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td class="OmsAm">more text</td>
    </tr>
    <tr>
        <td class="DT">30-04-14</td>
        <td class="Regio">Text</td>
        <td class="Md">Text</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td class="OmsBr">Some other Text</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td class="OmsBr">More Text</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td class="OmsBr">Some different text</td>
    </tr>
</table>

I need all <td> in following siblings <tr> after a <tr> with some values in his <td>s, but until the next <tr> with some values in all <td>s.

E.g. assuming my current position is the first <tr>, I would need these table cells:

    <td class="SomeClass">Some other text</td>
    <td class="SomeOtherClass">Some more text</td>

Assuming my current position is the table row 4

<tr>
    <td class="DT">22-04-14</td>
    <td class="Regio">Text</td>
    <td class="Md">Text</td>
</tr>

I would only need

    <td class="OmsAm">more text</td>

This is the Xpath I'm using to get all sibling <tr>, but it gets me all follinwg siblings, and not until the sibling it should stop: ./following-sibling::tr/td[1][not(text()[1])]/..

I think I have to implement the Kayesian method, but I don't understand this in my case. Any help would be really apreciated!

Upvotes: 1

Views: 1723

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

I may be misinterpreting the question, but if, for each <tr><td class="DT">xx-xx-xx</td>, you want all <tr> after it, and before the next <tr><td class="DT">xx-xx-xx</td>, one pattern is to loop on these "boundary" <tr><td class="DT">xx-xx-xx</td> elements, and selecting following sibling rows with a condition on how many "boundaries" are found before.

Let's use lxml to illustrate. First, we create a document from your sample input:

>>> import lxml.html
>>> t = '''<table>
...     <tr>
...         <td class="DT">29-04-14</td>
...         <td class="Regio">Text</td>
...         <td class="Md">Text</td>
...     </tr>
...     <tr>
...         <td></td>
...         <td></td>
...         <td class="SomeClass">Some other text</td>
...     </tr>
...     <tr>
...         <td></td>
...         <td></td>
...         <td class="SomeOtherClass">Some more text</td>
...     </tr>
...     <tr>
...         <td class="DT">22-04-14</td>
...         <td class="Regio">Text</td>
...         <td class="Md">Text</td>
...     </tr>
...     <tr>
...         <td></td>
...         <td></td>
...         <td class="OmsAm">more text</td>
...     </tr>
...     <tr>
...         <td class="DT">30-04-14</td>
...         <td class="Regio">Text</td>
...         <td class="Md">Text</td>
...     </tr>
...     <tr>
...         <td></td>
...         <td></td>
...         <td class="OmsBr">Some other Text</td>
...     </tr>
...     <tr>
...         <td></td>
...         <td></td>
...         <td class="OmsBr">More Text</td>
...     </tr>
...     <tr>
...         <td></td>
...         <td></td>
...         <td class="OmsBr">Some different text</td>
...     </tr>
... </table>'''
>>> doc = lxml.html.fromstring(t)

Now, let's count these <tr><td class="DT">xx-xx-xx</td>:

>>> doc.xpath('//table/tr[td/@class="DT"]')
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>, <Element tr at 0x7f948ab00638>]
>>> doc.xpath('count(//table/tr[td/@class="DT"])')
3.0
>>> list(enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1))
[(1, <Element tr at 0x7f948ab00548>), (2, <Element tr at 0x7f948ab005e8>), (3, <Element tr at 0x7f948ab00638>)]

We can loop on these rows and select the rows that come after in the document (we'll select text nodes to "see" which row these are:

>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
...     print( row.xpath('./following-sibling::tr/td/text()') )
... 
['Some other text', 'Some more text', '22-04-14', 'Text', 'Text', 'more text', '30-04-14', 'Text', 'Text', 'Some other Text', 'More Text', 'Some different text']
['more text', '30-04-14', 'Text', 'Text', 'Some other Text', 'More Text', 'Some different text']
['Some other Text', 'More Text', 'Some different text']

We're selecting too many rows in each iteration, all the rows until the end of the <table>. We need an additional "end" condition for following rows.

We're counting the tr[td/@class="DT"] in the loop, so we can check how many preceding tr[td/@class="DT"] each row has:

For the 1st set:

row.xpath('./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=1]

For the 2nd:

row.xpath('./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=2]

etc.

So, in the loop, we can use the current count with an XPath variable with lxml (an underrated XPath feature supported by lxml):

>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
...     print( row.xpath('./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=$count]', count=cnt) )
... 
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>, <Element tr at 0x7f948ec02f98>]
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab00638>]
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>, <Element tr at 0x7f948ab00688>]
>>> 

Hm, we're selecting 1 row too much in each iteration.

That's because <tr><td class="DT">30-04-14</td> also has 1 preceding <tr><td class="DT">

We can add an extra predicate for selecting rows that do NOT have a <td class="DT">

>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
...     print( row.xpath('''
...         ./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=$count]
...                                [not(td/@class="DT")]''', count=cnt) )
... 
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>]
[<Element tr at 0x7f948ab00548>]
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>, <Element tr at 0x7f948ab00688>]
>>> 

The number of results per iteration looks right. Let's finally check using text nodes:

>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
...     print( row.xpath('''
...         ./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=$count]
...                                [not(td/@class="DT")]
...             /td/text()''', count=cnt) )
... 
['Some other text', 'Some more text']
['more text']
['Some other Text', 'More Text', 'Some different text']
>>> 

Upvotes: 1

Related Questions