Reputation: 503
I've got a HTML-file which looks like this (simplified):
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
What I'd like to extract is the content of "table class="main"", so in explicit words, I'd like to extract the same as it is written above to a file. Consider: The example is simplified; around the -tags, there are many others... I tried to extract the content using the following code:
root = lxml.html.parse('www.test.xyz').getroot()
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
empty.getparent().remove(empty)
tables = root.cssselect('table.main')
The above code works. But the problem is that I got a part twice; see what I mean: The result of the code is:
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
So the problem is that the middle part appears one time too much at the end. Why is this and how can this be omitted and fixed?
paul t., also a stackoverflow-user, told me to use "root.xpath('//table[@class="main" and not(.//table[@class="main"])]')". This code prints out exactly the part I have twice.
I hope the problem is described clearly enough...thanks for any help and any propositions :)
Upvotes: 0
Views: 161
Reputation: 28266
You want to select all the tables with class "main" which are not already selected as descendants of the same elements.
This seems to work fine:
root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')
Upvotes: 1