MarkF6
MarkF6

Reputation: 503

Extract content of a HTML-file

I've got a HTML-file which looks like this (simplified):

<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>

What I'd like to extract is the content of "table class="main"", so in explicit words, I'd like to extract the same as it is written above to a file. Consider: The example is simplified; around the -tags, there are many others... I tried to extract the content using the following code:

root = lxml.html.parse('www.test.xyz').getroot()

for empty in root.xpath('//*[self::b or self::i][not(node())]'):
    empty.getparent().remove(empty)

tables = root.cssselect('table.main')

The above code works. But the problem is that I got a part twice; see what I mean: The result of the code is:

<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style="table-layout:fixed; width:325.68pt; height:528.96pt;">
Here is some text.
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>
Here are also some words...
</table>
<table class="main" frame="box" rules="all" border="0" cellspacing="0" cellpadding="0" style=" width:50.88pt; height:77.28pt;">
Here is another text which ends right here.
</table>

So the problem is that the middle part appears one time too much at the end. Why is this and how can this be omitted and fixed?

paul t., also a stackoverflow-user, told me to use "root.xpath('//table[@class="main" and not(.//table[@class="main"])]')". This code prints out exactly the part I have twice.

I hope the problem is described clearly enough...thanks for any help and any propositions :)

Upvotes: 0

Views: 161

Answers (1)

stranac
stranac

Reputation: 28266

You want to select all the tables with class "main" which are not already selected as descendants of the same elements.
This seems to work fine:

root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

Upvotes: 1

Related Questions