How can I grab the rows from a specific table using an XPATH expression - mine seems to grab all the rows in the document

Question

I have read in a file and created a tree using

my_tree = html.fromstring(html_string)

Then I separated all of the tables

tables = my_tree.xpath('//table')

Now I have been playing with the tables I am trying to develop an approach to find the closest match in the document to a model table. I was listing all of the attributes I could consider and thought about trying to find a way to implement consideration of the number of rows in each table to compare to the number of rows in my test table.

So I did

table_lens = [len(table.xpath('//tr')) for table in tables]

The interesting thing is that all values in my table_lens list are the same.

I think that the value is the total number of tr in the document (it seems roughly correct)

I expected to have a unique value corresponding to the number of rows in each table.

Now this is interesting because I also 'looked' at the tr elements for two tables by

for tr in tables[20].xpath('//tr'):
    tr

And a cursory inspection shows that the tr elements dumped each reference the same memory location so I then did

tables[20].xpath('//tr') == tables[50].xpath('//tr')

and the interpreter returned

True

So this is fascinating - I thought I would be working with just the rows that belong to a particular table but instead I am getting all of the rows in all of the tables.

On top of all of this I should note that the table[index].text_content() for each table[index] is unique.

To confirm that each table in tables is unique I also did tis

>>> tables[20]

>>> tables[50]

>>>

Abarnert's comment below suggested the behavior is due to something about the file. Interesting possibility but after the comment was posted I did a second file and got the same results. But here is an example htm file

http://www.sec.gov/Archives/edgar/data/22252/000119312512253074/d360877ddef14a.htm

In this second example there are 33 unique tables and each has 173 tr

abarnert · Accepted Answer

In XPath, //tr is an absolute path—all tr nodes from the top of the document. tr is a relative path—all tr nodes under the current node. It's just like using /foo instead of foo in a filename.

So, just do this:

table_lens = [len(table.xpath('tr')) for table in tables]

And you'll get a variety of different numbers from 1 to 14 (or maybe more, I didn't look at the whole list).

How can I grab the rows from a specific table using an XPATH expression - mine seems to grab all the rows in the document

Answers (1)

Related Questions