Reputation: 31
Just want to archive some Pandas DataFrames in a HDF5 store (.h5 file). Below is the code I'm using.
# Fake data over N runs
Data_N = []
for n in range(5):
Data_N.append(np.random.randn(5000,15,125))
# Create HDFStore object
store = pd.HDFStore('test.h5')
# For each run:
for n in range(len(Data_N)):
Data = Data_N[n]
# Pandas DataFrame for "flattened" fake data
Data_subDFs = []
nanbuff = np.nan*np.zeros((1,len(Data[0,0])))
for i in range(len(Data)):
Data_i = np.vstack((nanbuff,Data[i,:,:]))
Data_subDFs.append(pd.DataFrame(data = Data_i))
Data_DF = pd.concat(Data_subDFs)
# Row and column labels for the DataFrame
Data_rows = []
for i in range(len(Data)):
Data_rows.append(['Layer %d:' % (i+1)] + range(1,len(Data[0])+1))
Data_DF.index = sum(Data_rows,[])
Data_DF.columns = range(1,len(Data[0,0])+1)
# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (n+1), Data_DF)
#store.put('Data_DF_%d' % (n+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (n+1), Data_DF, format='table', data_columns=True)
# Save the HDF5 file
store.close()
This gives the following output:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->axis1] [items->None]
If I use the second version of put, it gives:
TypeError: Passing an incorrect value to a table column. Expected a Col (or subc
lass) instance and got: "ObjectAtom()". Please make use of the Col(), or descend
ant, constructor to properly initialize columns.
And if I use the third version of put, it gives:
ValueError: cannot have non-object label DataIndexableCol
Can someone please explain the different versions, and why I can't save what I think is a valid Pandas DataFrame in HDF5 without the pickling?
If it helps, I don't think I need to be able to append the DataFrame/store. I just want the best performing way to save the DF using Pandas HDF5 interface.
Thanks!
EDIT 1:
I updated the code after "For each run:" to this
# For each run:
for run in range(len(Data_N)):
Data = Data_N[run]
l = len(Data)
m = len(Data[0])
n = len(Data[0,0])
# Pandas DataFrame for "flattened" fake data
Data_subDFs = []
for i in range(len(Data)):
Data_i = Data[i,:,:]
Data_subDFs.append(pd.DataFrame(data = Data_i))
Data_DF = pd.concat(Data_subDFs)
# Row and column labels for the DataFrame
L1 = np.zeros((l*m,1), dtype=object) # Layer number
L2 = np.zeros((l*m,1), dtype=object) # Row number
for i in range(l):
for j in range(m):
L1[i*m + j,0] = 'Layer %d' % (i+1)
L2[i*m + j,0] = '%d' % (j+1)
Data_DF.index = np.hstack((L1,L2))
Data_DF.columns = range(1,n+1)
# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (run+1), Data_DF)
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)
But that gives the same warning or errors, for each put line.
EDIT 2 (this worked!):
# For each run:
for run in range(len(Data_N)):
Data = Data_N[run]
l = len(Data)
m = len(Data[0])
n = len(Data[0,0])
# Pandas DataFrame for "flattened" fake data
Data_DF = pd.DataFrame(Data.reshape(l*m,n))
# Layer and row labels
layers = np.arange(1,l+1)
rows = np.arange(1,m+1)
# Pandas multi-index
mindex = pd.MultiIndex.from_product([layers,rows], names=['Layer','Row'])
# DataFrame multi-index and column labels
Data_DF.index = mindex
Data_DF.columns = range(1,n+1)
# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (run+1), Data_DF)
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)
The third put line still gives the same error, but since the second line works I'll assume that the third line is just an invalid command in this case.
Second line was quite a bit faster than first line, as well, and both considerably faster than the pickling route. Thanks!
Upvotes: 2
Views: 2642
Reputation: 210962
UPDATE:
Here is a small demo:
Setup:
data = np.random.randn(5,10,5)
index = pd.MultiIndex.from_product([np.arange(1, len(data)+1),
np.arange(1,len(data[0])+1)], names=['Layer','No'])
df = pd.DataFrame(data.reshape(data.shape[0] * data.shape[1], data.shape[2]),
index=index)
Data:
In [82]: df
Out[82]:
0 1 2 3 4
Layer No
1 1 1.167144 0.640303 0.059197 -1.637180 0.667196
2 2.150872 -0.825325 -0.332458 -1.307043 1.361330
3 -0.931299 -0.931882 0.153943 -0.446289 0.651594
4 -0.131500 -0.489745 1.264029 0.889779 1.081613
5 -0.479022 -1.516204 0.616170 0.126860 0.125559
6 1.114287 -0.939504 0.058869 0.321159 0.340881
7 -0.527516 -0.362337 -0.590430 -0.609017 1.835716
8 0.063372 0.000792 0.855485 -0.113592 0.890687
9 -0.160041 1.978954 0.778428 1.988354 2.095665
10 0.687911 0.115918 -0.653885 0.486365 -0.775659
2 1 -0.123350 0.674359 -0.120634 -1.350044 -0.176252
2 -1.986077 -0.846584 0.895982 0.236790 0.240023
3 0.878597 0.241594 0.405382 1.785109 1.228188
4 -1.510238 -0.303274 0.247082 1.841996 -0.864595
5 -1.424249 -0.183216 -0.044330 0.324894 -0.271179
6 -0.345720 -0.942421 0.538227 -0.558793 -1.075346
7 1.327952 -2.335520 -0.164645 1.489798 -0.876896
8 1.043723 0.770489 -1.052739 -0.830190 1.005406
9 0.789100 -0.706633 -1.014431 -1.164513 -0.266424
10 2.061175 0.933526 -1.601836 -1.542535 -1.220943
3 1 -0.061520 -0.932599 0.103480 -0.318529 -0.311965
2 -0.401409 -0.308739 -1.399233 -1.172032 -0.550774
3 0.670272 1.215724 0.711328 2.332297 -1.326704
4 0.377469 0.752313 -1.223832 0.431555 -0.901796
5 -2.386383 0.053921 -1.175427 -0.794099 -0.469374
6 0.951571 -2.220609 0.208136 -2.141828 0.010316
7 1.047133 0.924568 0.282091 1.367981 -0.617389
8 1.083008 -1.519416 0.535690 0.196885 -0.022692
9 1.307252 1.099716 0.766976 -0.466699 1.113605
10 -0.614214 0.702395 -0.131248 1.773092 0.241553
4 1 -1.280026 0.278248 -0.518560 -0.395394 0.434473
2 1.498882 -1.359542 0.012312 -0.231728 -2.643232
3 -0.539773 -0.755483 -1.002526 0.198792 -0.120656
4 0.056788 1.289477 -0.440122 -1.454418 -0.043193
5 -0.777678 1.734322 -1.270129 0.160094 0.355290
6 -1.037775 -0.542944 -0.913428 0.885965 -0.155220
7 -0.855498 -0.330268 -1.903738 0.098101 -0.670830
8 0.786258 0.599100 -0.426781 0.425572 0.132932
9 -0.430497 -1.414292 -0.997637 0.696176 -0.480886
10 1.211665 -1.233842 0.137176 1.520013 -1.052884
5 1 -0.267698 -1.013917 -1.324896 -1.189835 -0.192396
2 1.047264 -0.454829 1.051039 1.565423 0.749844
3 0.159177 0.481088 0.711499 -1.217079 0.444402
4 0.254420 -0.114102 0.620231 1.890822 1.269808
5 0.673696 -0.321638 -0.887355 0.426549 -0.935591
6 -1.836808 0.450332 1.187512 -0.215318 -1.142346
7 -1.496568 0.633886 0.625143 0.295385 1.445084
8 -0.473427 -0.608318 -0.602080 0.134105 0.704027
9 2.319899 0.763272 0.861798 1.464612 -0.708869
10 -0.199555 0.721122 0.099777 -0.466488 0.923112
In [84]: df.index.levels
Out[84]: FrozenList([[1, 2, 3, 4, 5], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
Now you can slice it like as follows:
In [85]: idx = pd.IndexSlice
In [86]: df.loc[idx[[2,4], 2:5], :]
Out[86]:
0 1 2 3 4
Layer No
2 2 -1.986077 -0.846584 0.895982 0.236790 0.240023
3 0.878597 0.241594 0.405382 1.785109 1.228188
4 -1.510238 -0.303274 0.247082 1.841996 -0.864595
5 -1.424249 -0.183216 -0.044330 0.324894 -0.271179
4 2 1.498882 -1.359542 0.012312 -0.231728 -2.643232
3 -0.539773 -0.755483 -1.002526 0.198792 -0.120656
4 0.056788 1.289477 -0.440122 -1.454418 -0.043193
5 -0.777678 1.734322 -1.270129 0.160094 0.355290
Saving to and selecting from HDF store:
In [88]: store = pd.HDFStore('d:/temp/test.h5')
In [89]: store.append('test', df, complib='blosc', complevel=5)
In [90]: store.close()
In [91]: store = pd.HDFStore('d:/temp/test.h5')
In [92]: store.select('test', where="Layer in [2,4] and No in [2,4,6]")
Out[92]:
0 1 2 3 4
Layer No
2 2 -1.986077 -0.846584 0.895982 0.236790 0.240023
4 -1.510238 -0.303274 0.247082 1.841996 -0.864595
6 -0.345720 -0.942421 0.538227 -0.558793 -1.075346
4 2 1.498882 -1.359542 0.012312 -0.231728 -2.643232
4 0.056788 1.289477 -0.440122 -1.454418 -0.043193
6 -1.037775 -0.542944 -0.913428 0.885965 -0.155220
MultiIndex documentation (with two levels: Layer
, No
) instead.
Upvotes: 1