Reputation: 4950
I have read in a file and created a tree using
my_tree = html.fromstring(html_string)
Then I separated all of the tables
tables = my_tree.xpath('//table')
Now I have been playing with the tables I am trying to develop an approach to find the closest match in the document to a model table. I was listing all of the attributes I could consider and thought about trying to find a way to implement consideration of the number of rows in each table to compare to the number of rows in my test table.
So I did
table_lens = [len(table.xpath('//tr')) for table in tables]
The interesting thing is that all values in my table_lens list are the same.
I think that the value is the total number of tr in the document (it seems roughly correct)
I expected to have a unique value corresponding to the number of rows in each table.
Now this is interesting because I also 'looked' at the tr elements for two tables by
for tr in tables[20].xpath('//tr'):
tr
And a cursory inspection shows that the tr elements dumped each reference the same memory location so I then did
tables[20].xpath('//tr') == tables[50].xpath('//tr')
and the interpreter returned
True
So this is fascinating - I thought I would be working with just the rows that belong to a particular table but instead I am getting all of the rows in all of the tables.
On top of all of this I should note that the table[index].text_content() for each table[index] is unique.
To confirm that each table in tables is unique I also did tis
>>> tables[20]
<Element table at 0x3260e60>
>>> tables[50]
<Element table at 0x3273570>
>>>
Abarnert's comment below suggested the behavior is due to something about the file. Interesting possibility but after the comment was posted I did a second file and got the same results. But here is an example htm file
http://www.sec.gov/Archives/edgar/data/22252/000119312512253074/d360877ddef14a.htm
In this second example there are 33 unique tables and each has 173 tr
Upvotes: 0
Views: 145
Reputation: 366133
In XPath, //tr
is an absolute path—all tr nodes from the top of the document. tr
is a relative path—all tr nodes under the current node. It's just like using /foo
instead of foo
in a filename.
So, just do this:
table_lens = [len(table.xpath('tr')) for table in tables]
And you'll get a variety of different numbers from 1 to 14 (or maybe more, I didn't look at the whole list).
Upvotes: 1