Saving Pandas DataFrames as a HDF5 store, various errors

Question

Just want to archive some Pandas DataFrames in a HDF5 store (.h5 file). Below is the code I'm using.

# Fake data over N runs
Data_N = []
for n in range(5):
    Data_N.append(np.random.randn(5000,15,125))

# Create HDFStore object
store = pd.HDFStore('test.h5')

# For each run:
for n in range(len(Data_N)):
    Data = Data_N[n]

    # Pandas DataFrame for "flattened" fake data
    Data_subDFs = []
    nanbuff = np.nan*np.zeros((1,len(Data[0,0])))

    for i in range(len(Data)):
        Data_i = np.vstack((nanbuff,Data[i,:,:]))
        Data_subDFs.append(pd.DataFrame(data = Data_i))

    Data_DF = pd.concat(Data_subDFs)

    # Row and column labels for the DataFrame
    Data_rows = []
    for i in range(len(Data)):
        Data_rows.append(['Layer %d:' % (i+1)] + range(1,len(Data[0])+1))

    Data_DF.index = sum(Data_rows,[])
    Data_DF.columns = range(1,len(Data[0,0])+1)

    # Put Pandas DataFrame into store
    store.put('Data_DF_%d' % (n+1), Data_DF)
    #store.put('Data_DF_%d' % (n+1), Data_DF, format='table')
    #store.put('Data_DF_%d' % (n+1), Data_DF, format='table', data_columns=True)

# Save the HDF5 file
store.close()

This gives the following output:

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->axis1] [items->None]

If I use the second version of put, it gives:

TypeError: Passing an incorrect value to a table column. Expected a Col (or subc
lass) instance and got: "ObjectAtom()". Please make use of the Col(), or descend
ant, constructor to properly initialize columns.

And if I use the third version of put, it gives:

ValueError: cannot have non-object label DataIndexableCol

Can someone please explain the different versions, and why I can't save what I think is a valid Pandas DataFrame in HDF5 without the pickling?

If it helps, I don't think I need to be able to append the DataFrame/store. I just want the best performing way to save the DF using Pandas HDF5 interface.

Thanks!

EDIT 1:

I updated the code after "For each run:" to this

# For each run:
for run in range(len(Data_N)):
    Data = Data_N[run]
    l = len(Data)
    m = len(Data[0])
    n = len(Data[0,0])

    # Pandas DataFrame for "flattened" fake data
    Data_subDFs = []

    for i in range(len(Data)):
        Data_i = Data[i,:,:]
        Data_subDFs.append(pd.DataFrame(data = Data_i))

    Data_DF = pd.concat(Data_subDFs)

    # Row and column labels for the DataFrame
    L1 = np.zeros((l*m,1), dtype=object) # Layer number
    L2 = np.zeros((l*m,1), dtype=object) # Row number

    for i in range(l):
        for j in range(m):
            L1[i*m + j,0] = 'Layer %d' % (i+1)
            L2[i*m + j,0] = '%d' % (j+1)

    Data_DF.index = np.hstack((L1,L2))
    Data_DF.columns = range(1,n+1)

    # Put Pandas DataFrame into store
    store.put('Data_DF_%d' % (run+1), Data_DF)
    #store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
    #store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)

But that gives the same warning or errors, for each put line.

EDIT 2 (this worked!):

# For each run:
for run in range(len(Data_N)):
    Data = Data_N[run]
    l = len(Data)
    m = len(Data[0])
    n = len(Data[0,0])

    # Pandas DataFrame for "flattened" fake data
    Data_DF = pd.DataFrame(Data.reshape(l*m,n))

    # Layer and row labels
    layers = np.arange(1,l+1)
    rows = np.arange(1,m+1)

    # Pandas multi-index
    mindex = pd.MultiIndex.from_product([layers,rows], names=['Layer','Row'])

    # DataFrame multi-index and column labels
    Data_DF.index = mindex
    Data_DF.columns = range(1,n+1)

    # Put Pandas DataFrame into store
    store.put('Data_DF_%d' % (run+1), Data_DF)
    #store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
    #store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)

The third put line still gives the same error, but since the second line works I'll assume that the third line is just an invalid command in this case.

Second line was quite a bit faster than first line, as well, and both considerably faster than the pickling route. Thanks!

MaxU - stand with Ukraine · Accepted Answer

UPDATE:

Here is a small demo:

Setup:

data = np.random.randn(5,10,5)
index = pd.MultiIndex.from_product([np.arange(1, len(data)+1),
                                  np.arange(1,len(data[0])+1)], names=['Layer','No'])
df = pd.DataFrame(data.reshape(data.shape[0] * data.shape[1], data.shape[2]),
                  index=index)

Data:

In [82]: df
Out[82]:
                 0         1         2         3         4
Layer No
1     1   1.167144  0.640303  0.059197 -1.637180  0.667196
      2   2.150872 -0.825325 -0.332458 -1.307043  1.361330
      3  -0.931299 -0.931882  0.153943 -0.446289  0.651594
      4  -0.131500 -0.489745  1.264029  0.889779  1.081613
      5  -0.479022 -1.516204  0.616170  0.126860  0.125559
      6   1.114287 -0.939504  0.058869  0.321159  0.340881
      7  -0.527516 -0.362337 -0.590430 -0.609017  1.835716
      8   0.063372  0.000792  0.855485 -0.113592  0.890687
      9  -0.160041  1.978954  0.778428  1.988354  2.095665
      10  0.687911  0.115918 -0.653885  0.486365 -0.775659
2     1  -0.123350  0.674359 -0.120634 -1.350044 -0.176252
      2  -1.986077 -0.846584  0.895982  0.236790  0.240023
      3   0.878597  0.241594  0.405382  1.785109  1.228188
      4  -1.510238 -0.303274  0.247082  1.841996 -0.864595
      5  -1.424249 -0.183216 -0.044330  0.324894 -0.271179
      6  -0.345720 -0.942421  0.538227 -0.558793 -1.075346
      7   1.327952 -2.335520 -0.164645  1.489798 -0.876896
      8   1.043723  0.770489 -1.052739 -0.830190  1.005406
      9   0.789100 -0.706633 -1.014431 -1.164513 -0.266424
      10  2.061175  0.933526 -1.601836 -1.542535 -1.220943
3     1  -0.061520 -0.932599  0.103480 -0.318529 -0.311965
      2  -0.401409 -0.308739 -1.399233 -1.172032 -0.550774
      3   0.670272  1.215724  0.711328  2.332297 -1.326704
      4   0.377469  0.752313 -1.223832  0.431555 -0.901796
      5  -2.386383  0.053921 -1.175427 -0.794099 -0.469374
      6   0.951571 -2.220609  0.208136 -2.141828  0.010316
      7   1.047133  0.924568  0.282091  1.367981 -0.617389
      8   1.083008 -1.519416  0.535690  0.196885 -0.022692
      9   1.307252  1.099716  0.766976 -0.466699  1.113605
      10 -0.614214  0.702395 -0.131248  1.773092  0.241553
4     1  -1.280026  0.278248 -0.518560 -0.395394  0.434473
      2   1.498882 -1.359542  0.012312 -0.231728 -2.643232
      3  -0.539773 -0.755483 -1.002526  0.198792 -0.120656
      4   0.056788  1.289477 -0.440122 -1.454418 -0.043193
      5  -0.777678  1.734322 -1.270129  0.160094  0.355290
      6  -1.037775 -0.542944 -0.913428  0.885965 -0.155220
      7  -0.855498 -0.330268 -1.903738  0.098101 -0.670830
      8   0.786258  0.599100 -0.426781  0.425572  0.132932
      9  -0.430497 -1.414292 -0.997637  0.696176 -0.480886
      10  1.211665 -1.233842  0.137176  1.520013 -1.052884
5     1  -0.267698 -1.013917 -1.324896 -1.189835 -0.192396
      2   1.047264 -0.454829  1.051039  1.565423  0.749844
      3   0.159177  0.481088  0.711499 -1.217079  0.444402
      4   0.254420 -0.114102  0.620231  1.890822  1.269808
      5   0.673696 -0.321638 -0.887355  0.426549 -0.935591
      6  -1.836808  0.450332  1.187512 -0.215318 -1.142346
      7  -1.496568  0.633886  0.625143  0.295385  1.445084
      8  -0.473427 -0.608318 -0.602080  0.134105  0.704027
      9   2.319899  0.763272  0.861798  1.464612 -0.708869
      10 -0.199555  0.721122  0.099777 -0.466488  0.923112

In [84]: df.index.levels
Out[84]: FrozenList([[1, 2, 3, 4, 5], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])

Now you can slice it like as follows:

In [85]: idx = pd.IndexSlice

In [86]: df.loc[idx[[2,4], 2:5], :]
Out[86]:
                 0         1         2         3         4
Layer No
2     2  -1.986077 -0.846584  0.895982  0.236790  0.240023
      3   0.878597  0.241594  0.405382  1.785109  1.228188
      4  -1.510238 -0.303274  0.247082  1.841996 -0.864595
      5  -1.424249 -0.183216 -0.044330  0.324894 -0.271179
4     2   1.498882 -1.359542  0.012312 -0.231728 -2.643232
      3  -0.539773 -0.755483 -1.002526  0.198792 -0.120656
      4   0.056788  1.289477 -0.440122 -1.454418 -0.043193
      5  -0.777678  1.734322 -1.270129  0.160094  0.355290

Saving to and selecting from HDF store:

In [88]: store = pd.HDFStore('d:/temp/test.h5')

In [89]: store.append('test', df, complib='blosc', complevel=5)

In [90]: store.close()

In [91]: store = pd.HDFStore('d:/temp/test.h5')

In [92]: store.select('test', where="Layer in [2,4] and No in [2,4,6]")
Out[92]:
                 0         1         2         3         4
Layer No
2     2  -1.986077 -0.846584  0.895982  0.236790  0.240023
      4  -1.510238 -0.303274  0.247082  1.841996 -0.864595
      6  -0.345720 -0.942421  0.538227 -0.558793 -1.075346
4     2   1.498882 -1.359542  0.012312 -0.231728 -2.643232
      4   0.056788  1.289477 -0.440122 -1.454418 -0.043193
      6  -1.037775 -0.542944 -0.913428  0.885965 -0.155220

MultiIndex documentation (with two levels: Layer, No) instead.

Saving Pandas DataFrames as a HDF5 store, various errors

Answers (1)

Related Questions