Kaushik Ghose
Kaushik Ghose

Reputation: 1062

Pandas, multiindex, dates, HDFstore and frame_tables

I want to use a MultiIndex with dates as one of the hierarchical index types. I also want to save the DataFrame as a frame_table, so that I can select subsets from disk without loading the whole thing. I currently get an error: TypeError: [date] is not implemented as a table column and I was wondering if I am using the multiindex incorrectly, or this is indeed a limitation of Pandas. Thanks!

import pandas as pd, numpy, datetime

print pd.__version__ #-> 0.13.0rc1

idx1 = pd.MultiIndex.from_tuples([(datetime.date(2013,12,d), s, t) for d in range(1,3) for s in range(2) for t in range(3)])
df1 = pd.DataFrame(data=numpy.zeros((len(idx1),2)), columns=['a','b'], index=idx1)

with pd.get_store('test1.h5') as f:
  f.put('trials',df1) #-> OK

with pd.get_store('test2.h5') as f:
  f.put('trials',df1,data_colums=True,format='t') #-> TypeError: [date] is not implemented as a table column

Upvotes: 2

Views: 1169

Answers (1)

Jeff
Jeff

Reputation: 128918

Use datetime.datetime as these types can be stored efficiently. Docs are here for an example of storing a multi-index frame in a HDFStore.

When storing a multi-index, you MUST specify names for the levels (HDFStore currently won't warn you if you try to store it ATM; this will be addressed in the next release).

In [20]: idx1 = pd.MultiIndex.from_tuples([(datetime.datetime(2013,12,d), s, t) for d in range(1,3) for s in range(2) for t in range(3)],names=['date','s','t'])

In [21]: df1 = pd.DataFrame(data=numpy.zeros((len(idx1),2)), columns=['a','b'], index=idx1)

You need to store as a table (put stores in Fixed format, unless append is specified).

In [22]: df1.to_hdf('test.h5','df',mode='w',format='table')

In [23]: pd.read_hdf('test.h5','df')
Out[23]: 
                a  b
date       s t      
2013-12-01 0 0  0  0
             1  0  0
             2  0  0
           1 0  0  0
             1  0  0
             2  0  0
2013-12-02 0 0  0  0
             1  0  0
             2  0  0
           1 0  0  0
             1  0  0
             2  0  0

[12 rows x 2 columns]

Sample selection

In [8]: pd.read_hdf('test.h5','df',where='date=20131202')
Out[8]: 
                a  b
date       s t      
2013-12-02 0 0  0  0
             1  0  0
             2  0  0
           1 0  0  0
             1  0  0
             2  0  0

[6 rows x 2 columns]

Upvotes: 2

Related Questions