Om Nom
Om Nom

Reputation: 179

pandas read_hdf with 'where' condition limitation?

I need to query an HDF5 file with where clause with 3 conditions, one of the condition is a list with a length of 30:

myList = list(xrange(30))

h5DF   = pd.read_hdf(h5Filename, 'df', where='index=myList & date=dateString & time=timeString')

The query above gives me ValueError: too many inputs and the error is reproducible.

If I reduce length of the list to 29 (three conditions):

myList = list(xrange(29))

h5DF   = pd.read_hdf(h5Filename, 'df', where='index=myList & date=dateString & time=timeString')

OR number of conditions to only two (list length of 30):

then it executes fine:

myList = list(xrange(30))

h5DF   = pd.read_hdf(h5Filename, 'df', where='index=myList & time=timeString')

Is this a known limitation? pandas documentation at http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.pytables.read_hdf.html doesn't mention about this limitation and seems like after searching this forum nobody encounter this limitation yet.

Version is pandas 0.15.2. Any help is appreciated.

Upvotes: 2

Views: 5736

Answers (1)

Jeff
Jeff

Reputation: 129018

This is answered here

This is a defect in that numpy/numexpr cannot handle more than 31 operands in the tree. An expression like foo=[1,2,3,4] in the where of the HDFStore generates an expression like (foo==1) | (foo==2) .... so these are expanded and if you have too many can fail.

HDFStore handles this with a single operand (IOW if you just have foo=[range(31)] is ok, but because you happen to have a nested sub-expression where the sub-nodes themselves are too long it errors.

Generally a better way to do this is to select a bigger range (e.g. maybe the end-points of the selection for each operand), then do an in-memory .isin. It might even be faster, because HDF5 tends to be more efficient IMHO when selecting larger ranges (even though you are bringing more data to memory), rather than individual selections.

Upvotes: 5

Related Questions