Reputation: 55
I need to work with large dimension data frame with multi index, so i tried to create a data frame to learn how to store it in an hdf5 file. The data frame is like this: (with the multi index in the first 2 columns)
Symbol Date 0
C 2014-07-21 4792
B 2014-07-21 4492
A 2014-07-21 5681
B 2014-07-21 8310
A 2014-07-21 1197
C 2014-07-21 4722
2014-07-21 7695
2014-07-21 1774
I'm using the pandas.to_hdf but it creates a "Fixed format store", when I try to select the datas in a group:
store.select('table','Symbol == "A"')
it returns some errors and the main problem is this
TypeError: cannot pass a where specification when reading from a Fixed format store. this store must be selected in its entirety
Then i tried to append the DataFrame like this:
store.append('ts1',timedata)
and that should create a table, but that gives me another error:
TypeError: [unicode] is not implemented as a table column
So i need the code to store the data frame in a table in hdf5 format and to select the datas from a single index (for that purpose i found this code: store.select('timedata','Symbol == "A"')
)
Upvotes: 5
Views: 6349
Reputation: 4686
Jeff has completely the right answer. I found a couple gotchas that I wanted to share and it won't fit in a comment - please consider this just a long form additional comment :)
(Pytables Versions) If you get missing attribute or method errors when trying to write the hdf file you may want to try updating your PyTables version. Pandas (as of this writing) leverages Pytables and I found at least one pairing of versions that threw some odd errors until I updated Pytables and reloaded.
(Data types) This may be fixed in Python 3 but in 2.7x the to_hdf has problems with unicode, with mixed data type columns, and with NaN values in floating point. Below is an example utility function to clean up a DataFrame in preparation for writing to_hdf that fixed all those problems for me. Note that this replaces NaN with zero, which was appropriate for my application but you may want to adjust that:
def clean_cols_for_hdf(data):
types = data.apply(lambda x: pd.lib.infer_dtype(x.values))
for col in types[types=='mixed'].index:
data[col] = .data[col].astype(str)
data[<your appropriate columns here>].fillna(0,inplace=True)
return data
Some of this just extends one of Jeff's comments as well. Jeff is awesome, please excuse the added answer but I wanted to chip in some details that fixed things for me.
Upvotes: 1
Reputation: 128918
Here's an example
In [8]: pd.__version__
Out[8]: '0.14.1'
In [9]: np.__version__
Out[9]: '1.8.1'
In [10]: import sys
In [11]: sys.version
Out[11]: '2.7.3 (default, Jan 7 2013, 09:17:50) \n[GCC 4.4.5]'
In [4]: df = DataFrame(np.arange(9).reshape(9,-1),index=pd.MultiIndex.from_product([list('abc'),date_range('20140721',periods=3)],names=['symbol','date']),columns=['value'])
In [5]: df
Out[5]:
value
symbol date
a 2014-07-21 0
2014-07-22 1
2014-07-23 2
b 2014-07-21 3
2014-07-22 4
2014-07-23 5
c 2014-07-21 6
2014-07-22 7
2014-07-23 8
In [6]: df.to_hdf('test.h5','df',mode='w',format='table')
In [7]: pd.read_hdf('test.h5','df',where='date=20140722')
Out[7]:
value
symbol date
a 2014-07-22 1
b 2014-07-22 4
c 2014-07-22 7
In [12]: pd.read_hdf('test.h5','df',where='symbol="a"')
Out[12]:
value
symbol date
a 2014-07-21 0
2014-07-22 1
2014-07-23 2
Upvotes: 6