pandas read_hdf with large set of conditions

I trying to read a HDF file of 64 GB (compressed with blosc) with the pandas library. The Dataframe contais 3 columns and 11410996915 rows. I am trying to select concrete rows by index with the method pandas.read_hdf and the argument where. The problem is that sometimes I need to get thousands of rows, so in my where parameter I put something like this:

simMat = pd.read_hdf('global.h5','matrix', columns=['similarity'], where='index in {}'.format(listIdx))

where listIdx is a list of integeres that represent the indexes I want to get back. When this list contains more than 31 elements, I get a memory error. I started looking inside the code of the pandas library and discovered that in the file pytables.py, in the class BinOp there is a variable called _max_selectors that has assigned the value 31. This variable is used after in this piece of code:

# if too many values to create the expression, use a filter instead
if self.op in ['==', '!='] and len(values) > self._max_selectors:

    filter_op = self.generate_filter_op()
    self.filter = (
       self.lhs,
       filter_op,
       pd.Index([v.value for v in values]))
    return self

The use of a filter causes the library to try to load the whole dataframe and this originates the MemoryError. I tried to use also the chunksize parameter with a value of 10 and it did not work neither. Do you know a better way to query an HDF file with such large set of indexes?

Upvotes: 3

Views: 1920

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210902

you can use the following technique:

pd.read_hdf(filename, 'key', where='index = vals')

where vals is pd.Series or a Python list variable

You can also use columns belonging to other DF for filtering:

In [201]: another_df = pd.DataFrame(np.random.randint(0, 100, size=(100, 3)), columns=list('abc'))

In [202]: pd.read_hdf(fn, 'test', where='a = another_df["a"]').shape
Out[202]: (573, 3)

or another DF's index:

In [203]: pd.read_hdf(fn, 'test', where='index = another_df.index').shape
Out[203]: (100, 3)

Demo:

setup

fn = r'D:\temp\.data\hdf\test.h5'

store = pd.HDFStore(fn)

df = pd.DataFrame(np.random.randint(0, 10**4, size=(10**5, 3)), columns=list('abc'))

store.append('test', df, data_columns=True, complib='blosc', complevel=5)

store.close()

Test

vals = pd.Series(np.random.randint(0, 100, 500))

In [196]: pd.read_hdf(fn, 'test', where='index = vals').shape
Out[196]: (98, 3)

Same with a Python list:

idx_list = vals.tolist()

In [197]: pd.read_hdf(fn, 'test', where='index = idx_list').shape
Out[197]: (98, 3)

Upvotes: 1

Related Questions