Reputation: 51
I trying to read a HDF file of 64 GB (compressed with blosc) with the pandas library. The Dataframe contais 3 columns and 11410996915 rows. I am trying to select concrete rows by index with the method pandas.read_hdf and the argument where. The problem is that sometimes I need to get thousands of rows, so in my where parameter I put something like this:
simMat = pd.read_hdf('global.h5','matrix', columns=['similarity'], where='index in {}'.format(listIdx))
where listIdx is a list of integeres that represent the indexes I want to get back. When this list contains more than 31 elements, I get a memory error. I started looking inside the code of the pandas library and discovered that in the file pytables.py, in the class BinOp there is a variable called _max_selectors that has assigned the value 31. This variable is used after in this piece of code:
# if too many values to create the expression, use a filter instead
if self.op in ['==', '!='] and len(values) > self._max_selectors:
filter_op = self.generate_filter_op()
self.filter = (
self.lhs,
filter_op,
pd.Index([v.value for v in values]))
return self
The use of a filter causes the library to try to load the whole dataframe and this originates the MemoryError. I tried to use also the chunksize parameter with a value of 10 and it did not work neither. Do you know a better way to query an HDF file with such large set of indexes?
Upvotes: 3
Views: 1920
Reputation: 210902
you can use the following technique:
pd.read_hdf(filename, 'key', where='index = vals')
where vals
is pd.Series
or a Python list
variable
You can also use columns belonging to other DF for filtering:
In [201]: another_df = pd.DataFrame(np.random.randint(0, 100, size=(100, 3)), columns=list('abc'))
In [202]: pd.read_hdf(fn, 'test', where='a = another_df["a"]').shape
Out[202]: (573, 3)
or another DF's index:
In [203]: pd.read_hdf(fn, 'test', where='index = another_df.index').shape
Out[203]: (100, 3)
Demo:
setup
fn = r'D:\temp\.data\hdf\test.h5'
store = pd.HDFStore(fn)
df = pd.DataFrame(np.random.randint(0, 10**4, size=(10**5, 3)), columns=list('abc'))
store.append('test', df, data_columns=True, complib='blosc', complevel=5)
store.close()
Test
vals = pd.Series(np.random.randint(0, 100, 500))
In [196]: pd.read_hdf(fn, 'test', where='index = vals').shape
Out[196]: (98, 3)
Same with a Python list:
idx_list = vals.tolist()
In [197]: pd.read_hdf(fn, 'test', where='index = idx_list').shape
Out[197]: (98, 3)
Upvotes: 1