Extract content of a HTML-file

Question

I've got a HTML-file which looks like this (simplified):


Here is some text.

Here is another text which ends right here.

Here are also some words...

What I'd like to extract is the content of "table class="main"", so in explicit words, I'd like to extract the same as it is written above to a file. Consider: The example is simplified; around the -tags, there are many others... I tried to extract the content using the following code:

root = lxml.html.parse('www.test.xyz').getroot()

for empty in root.xpath('//*[self::b or self::i][not(node())]'):
    empty.getparent().remove(empty)

tables = root.cssselect('table.main')

The above code works. But the problem is that I got a part twice; see what I mean: The result of the code is:


Here is some text.

Here is another text which ends right here.

Here are also some words...


Here is another text which ends right here.

So the problem is that the middle part appears one time too much at the end. Why is this and how can this be omitted and fixed?

paul t., also a stackoverflow-user, told me to use "root.xpath('//table[@class="main" and not(.//table[@class="main"])]')". This code prints out exactly the part I have twice.

I hope the problem is described clearly enough...thanks for any help and any propositions :)

stranac · Accepted Answer

You want to select all the tables with class "main" which are not already selected as descendants of the same elements.
This seems to work fine:

root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')

Extract content of a HTML-file

Answers (1)

Related Questions