Reputation: 5748
How shoud I do this in the fastest way?
I have a .h5 file with some tables. Tables have like 10millions (or more) rows each one.
The whole file is around 10GB, (file does not fit in memory)
The tables are "linked", meaning that, all of them have the same column (ID), used as the column to link between them.
Now, if I call my tables: table1, table2 table3 table4, etc... I am looking for the fastest way perform a fast search in table2, whith the ID data from table1.
As an example, this is what I have done so far:
#search on the table1 and get ID's for the first condition
searchID= "".join(["(ID==%i)|"%j['ID'] for j in table1.where('some conditions for table1')])[:-1]
#search on table2 based on the ID's from table1
for row in table2.where(searchID):
#do something with row
The problem is that I do not think that this is a very efficient solution. And, I have noticed that, if searchID
grows a lot, Spyder just crashes.....
Upvotes: 2
Views: 847
Reputation: 3637
There are a couple of things that you could do to potentially make this faster, though there is not a silver bullet.
If you can combine all of the tables into one table with more columns then you wouldn't have to loop through twice.
You could index the tables based on ID. This would improve search performance.
Change the chunkshape of the tables to be more optimal for you problem. If you make it smaller, then this should help with the crashes.
Upvotes: 1