Reputation: 13
I'm new to using Xpath. I'm trying to parse some data in Python using Xpath.
Parsing the following HTML:
<table>
<tr>
<td class="DT">29-04-14</td>
<td class="Regio">Text</td>
<td class="Md">Text</td>
</tr>
<tr>
<td></td>
<td></td>
<td class="SomeClass">Some other text</td>
</tr>
<tr>
<td></td>
<td></td>
<td class="SomeOtherClass">Some more text</td>
</tr>
<tr>
<td class="DT">22-04-14</td>
<td class="Regio">Text</td>
<td class="Md">Text</td>
</tr>
<tr>
<td></td>
<td></td>
<td class="OmsAm">more text</td>
</tr>
<tr>
<td class="DT">30-04-14</td>
<td class="Regio">Text</td>
<td class="Md">Text</td>
</tr>
<tr>
<td></td>
<td></td>
<td class="OmsBr">Some other Text</td>
</tr>
<tr>
<td></td>
<td></td>
<td class="OmsBr">More Text</td>
</tr>
<tr>
<td></td>
<td></td>
<td class="OmsBr">Some different text</td>
</tr>
</table>
I need all <td>
in following siblings <tr>
after a <tr>
with some values in his <td>
s, but until the next <tr>
with some values in all <td>
s.
E.g. assuming my current position is the first <tr>
, I would need these table cells:
<td class="SomeClass">Some other text</td>
<td class="SomeOtherClass">Some more text</td>
Assuming my current position is the table row 4
<tr>
<td class="DT">22-04-14</td>
<td class="Regio">Text</td>
<td class="Md">Text</td>
</tr>
I would only need
<td class="OmsAm">more text</td>
This is the Xpath I'm using to get all sibling <tr>
, but it gets me all follinwg siblings, and not until the sibling it should stop: ./following-sibling::tr/td[1][not(text()[1])]/..
I think I have to implement the Kayesian method, but I don't understand this in my case. Any help would be really apreciated!
Upvotes: 1
Views: 1723
Reputation: 20748
I may be misinterpreting the question, but if, for each <tr><td class="DT">xx-xx-xx</td>
, you want all <tr>
after it, and before the next <tr><td class="DT">xx-xx-xx</td>
, one pattern is to loop on these "boundary" <tr><td class="DT">xx-xx-xx</td>
elements, and selecting following sibling rows with a condition on how many "boundaries" are found before.
Let's use lxml
to illustrate. First, we create a document from your sample input:
>>> import lxml.html
>>> t = '''<table>
... <tr>
... <td class="DT">29-04-14</td>
... <td class="Regio">Text</td>
... <td class="Md">Text</td>
... </tr>
... <tr>
... <td></td>
... <td></td>
... <td class="SomeClass">Some other text</td>
... </tr>
... <tr>
... <td></td>
... <td></td>
... <td class="SomeOtherClass">Some more text</td>
... </tr>
... <tr>
... <td class="DT">22-04-14</td>
... <td class="Regio">Text</td>
... <td class="Md">Text</td>
... </tr>
... <tr>
... <td></td>
... <td></td>
... <td class="OmsAm">more text</td>
... </tr>
... <tr>
... <td class="DT">30-04-14</td>
... <td class="Regio">Text</td>
... <td class="Md">Text</td>
... </tr>
... <tr>
... <td></td>
... <td></td>
... <td class="OmsBr">Some other Text</td>
... </tr>
... <tr>
... <td></td>
... <td></td>
... <td class="OmsBr">More Text</td>
... </tr>
... <tr>
... <td></td>
... <td></td>
... <td class="OmsBr">Some different text</td>
... </tr>
... </table>'''
>>> doc = lxml.html.fromstring(t)
Now, let's count these <tr><td class="DT">xx-xx-xx</td>
:
>>> doc.xpath('//table/tr[td/@class="DT"]')
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>, <Element tr at 0x7f948ab00638>]
>>> doc.xpath('count(//table/tr[td/@class="DT"])')
3.0
>>> list(enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1))
[(1, <Element tr at 0x7f948ab00548>), (2, <Element tr at 0x7f948ab005e8>), (3, <Element tr at 0x7f948ab00638>)]
We can loop on these rows and select the rows that come after in the document (we'll select text nodes to "see" which row these are:
>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
... print( row.xpath('./following-sibling::tr/td/text()') )
...
['Some other text', 'Some more text', '22-04-14', 'Text', 'Text', 'more text', '30-04-14', 'Text', 'Text', 'Some other Text', 'More Text', 'Some different text']
['more text', '30-04-14', 'Text', 'Text', 'Some other Text', 'More Text', 'Some different text']
['Some other Text', 'More Text', 'Some different text']
We're selecting too many rows in each iteration, all the rows until the end of the <table>
. We need an additional "end" condition for following rows.
We're counting the tr[td/@class="DT"]
in the loop, so we can check how many preceding tr[td/@class="DT"]
each row has:
For the 1st set:
row.xpath('./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=1]
For the 2nd:
row.xpath('./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=2]
etc.
So, in the loop, we can use the current count with an XPath variable with lxml (an underrated XPath feature supported by lxml):
>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
... print( row.xpath('./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=$count]', count=cnt) )
...
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>, <Element tr at 0x7f948ec02f98>]
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab00638>]
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>, <Element tr at 0x7f948ab00688>]
>>>
Hm, we're selecting 1 row too much in each iteration.
That's because <tr><td class="DT">30-04-14</td>
also has 1 preceding <tr><td class="DT">
We can add an extra predicate for selecting rows that do NOT have a <td class="DT">
>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
... print( row.xpath('''
... ./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=$count]
... [not(td/@class="DT")]''', count=cnt) )
...
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>]
[<Element tr at 0x7f948ab00548>]
[<Element tr at 0x7f948ab00548>, <Element tr at 0x7f948ab005e8>, <Element tr at 0x7f948ab00688>]
>>>
The number of results per iteration looks right. Let's finally check using text nodes:
>>> for cnt, row in enumerate(doc.xpath('//table/tr[td/@class="DT"]'), start=1):
... print( row.xpath('''
... ./following-sibling::tr[count(./preceding-sibling::tr[td/@class="DT"])=$count]
... [not(td/@class="DT")]
... /td/text()''', count=cnt) )
...
['Some other text', 'Some more text']
['more text']
['Some other Text', 'More Text', 'Some different text']
>>>
Upvotes: 1