pandas read_hdf with large set of conditions

Question

I trying to read a HDF file of 64 GB (compressed with blosc) with the pandas library. The Dataframe contais 3 columns and 11410996915 rows. I am trying to select concrete rows by index with the method pandas.read_hdf and the argument where. The problem is that sometimes I need to get thousands of rows, so in my where parameter I put something like this:

simMat = pd.read_hdf('global.h5','matrix', columns=['similarity'], where='index in {}'.format(listIdx))

where listIdx is a list of integeres that represent the indexes I want to get back. When this list contains more than 31 elements, I get a memory error. I started looking inside the code of the pandas library and discovered that in the file pytables.py, in the class BinOp there is a variable called _max_selectors that has assigned the value 31. This variable is used after in this piece of code:

# if too many values to create the expression, use a filter instead
if self.op in ['==', '!='] and len(values) > self._max_selectors:

    filter_op = self.generate_filter_op()
    self.filter = (
       self.lhs,
       filter_op,
       pd.Index([v.value for v in values]))
    return self

The use of a filter causes the library to try to load the whole dataframe and this originates the MemoryError. I tried to use also the chunksize parameter with a value of 10 and it did not work neither. Do you know a better way to query an HDF file with such large set of indexes?

pandas read_hdf with large set of conditions

Answers (1)

Related Questions