Davide
Davide

Reputation: 55

Python (pandas): store a data frame in hdf5 with a multi index

I need to work with large dimension data frame with multi index, so i tried to create a data frame to learn how to store it in an hdf5 file. The data frame is like this: (with the multi index in the first 2 columns)

Symbol    Date          0

C         2014-07-21    4792
B         2014-07-21    4492
A         2014-07-21    5681
B         2014-07-21    8310
A         2014-07-21    1197
C         2014-07-21    4722
          2014-07-21    7695
          2014-07-21    1774

I'm using the pandas.to_hdf but it creates a "Fixed format store", when I try to select the datas in a group:

store.select('table','Symbol == "A"')

it returns some errors and the main problem is this

TypeError: cannot pass a where specification when reading from a Fixed format store. this store must be selected in its entirety

Then i tried to append the DataFrame like this:

store.append('ts1',timedata)

and that should create a table, but that gives me another error:

TypeError: [unicode] is not implemented as a table column

So i need the code to store the data frame in a table in hdf5 format and to select the datas from a single index (for that purpose i found this code: store.select('timedata','Symbol == "A"') )

Upvotes: 5

Views: 6349

Answers (2)

Ezekiel Kruglick
Ezekiel Kruglick

Reputation: 4686

Jeff has completely the right answer. I found a couple gotchas that I wanted to share and it won't fit in a comment - please consider this just a long form additional comment :)

(Pytables Versions) If you get missing attribute or method errors when trying to write the hdf file you may want to try updating your PyTables version. Pandas (as of this writing) leverages Pytables and I found at least one pairing of versions that threw some odd errors until I updated Pytables and reloaded.

(Data types) This may be fixed in Python 3 but in 2.7x the to_hdf has problems with unicode, with mixed data type columns, and with NaN values in floating point. Below is an example utility function to clean up a DataFrame in preparation for writing to_hdf that fixed all those problems for me. Note that this replaces NaN with zero, which was appropriate for my application but you may want to adjust that:

def clean_cols_for_hdf(data):
    types = data.apply(lambda x: pd.lib.infer_dtype(x.values))
    for col in types[types=='mixed'].index:
        data[col] = .data[col].astype(str)
    data[<your appropriate columns here>].fillna(0,inplace=True)
    return data

Some of this just extends one of Jeff's comments as well. Jeff is awesome, please excuse the added answer but I wanted to chip in some details that fixed things for me.

Upvotes: 1

Jeff
Jeff

Reputation: 128918

Here's an example

In [8]: pd.__version__
Out[8]: '0.14.1'

In [9]: np.__version__
Out[9]: '1.8.1'

In [10]: import sys

In [11]: sys.version
Out[11]: '2.7.3 (default, Jan  7 2013, 09:17:50) \n[GCC 4.4.5]'

In [4]: df = DataFrame(np.arange(9).reshape(9,-1),index=pd.MultiIndex.from_product([list('abc'),date_range('20140721',periods=3)],names=['symbol','date']),columns=['value'])

In [5]: df
Out[5]: 
                   value
symbol date             
a      2014-07-21      0
       2014-07-22      1
       2014-07-23      2
b      2014-07-21      3
       2014-07-22      4
       2014-07-23      5
c      2014-07-21      6
       2014-07-22      7
       2014-07-23      8

In [6]: df.to_hdf('test.h5','df',mode='w',format='table')

In [7]: pd.read_hdf('test.h5','df',where='date=20140722')
Out[7]: 
                   value
symbol date             
a      2014-07-22      1
b      2014-07-22      4
c      2014-07-22      7

In [12]: pd.read_hdf('test.h5','df',where='symbol="a"')
Out[12]: 
                   value
symbol date             
a      2014-07-21      0
       2014-07-22      1
       2014-07-23      2

Upvotes: 6

Related Questions