Reputation: 1499
I have some code, reduced to the example included below, which takes some raw data, creates a pivot table from it, then merges it with another dataframe, and finally stores the results in an HDFStore object. If it's stored in fixed format, it can be retrieved just fine. However, if stored in Table format, it produces an error. I need table format so that I can extract chunks at a time (total dataset is tens of millions of rows).
Any ideas what the problem is?
Code sample:
import pandas as pd
def createFrame():
data = {
'colA':[1,1,1,2,2,2,3,3,3,4,4,4,4],
'colB':[5,5,5,5,5,5,5,5,5,5,5,5,5],
'colC':['a','b','c','a','b','c','a','b','c','d','e','f','g'],
'colD':['ap','ap','ap','bp','bp','bp','cp','cp','cp','dp','dp','dp','dp']
}
frame = pd.DataFrame(data)
return frame
def createOtherFrame():
data = {
'colD':['ap','bp','cp','dp'],
'colE':[100,200,300,400]
}
frame = pd.DataFrame(data)
return frame
if __name__ == '__main__':
pd.set_option('display.width', 120) # default is 80
pd.set_option('io.hdf.default_format','table')
pivotStore = pd.HDFStore('test.h5')
frame = createFrame()
otherFrame = createOtherFrame()
pivoted = frame.pivot_table(['colB'],
index=['colA'],
columns='colC',
aggfunc='sum'
)
print(pivoted)
print('')
# print(pivoted.info(),end='\n\n')
mergedFrameA = pd.merge(frame[['colA','colD']].drop_duplicates(),
otherFrame,
how = 'left',
on='colD'
).set_index('colA')
# print(mergedFrameA.info())
print(mergedFrameA)
mergedFrameB = pd.merge(pivoted,mergedFrameA,how='left',left_index=True,right_index=True)
# print(mergedFrameB.info())
print(mergedFrameB)
pivotStore['bob'] = mergedFrameB
pivotStore.close()
store = pd.HDFStore('test.h5')
extracted = store.select('bob',start=0,stop=5)
print(extracted)
store.close()
Output produced (with error):
colB
colC a b c d e f g
colA
1 5 5 5 NaN NaN NaN NaN
2 5 5 5 NaN NaN NaN NaN
3 5 5 5 NaN NaN NaN NaN
4 NaN NaN NaN 5 5 5 5
colD colE
colA
1 ap 100
2 bp 200
3 cp 300
4 dp 400
(colB, a) (colB, b) (colB, c) (colB, d) (colB, e) (colB, f) (colB, g) colD colE
colA
1 5 5 5 NaN NaN NaN NaN ap 100
2 5 5 5 NaN NaN NaN NaN bp 200
3 5 5 5 NaN NaN NaN NaN cp 300
4 NaN NaN NaN 5 5 5 5 dp 400
Traceback (most recent call last):
File "C:\multiindextest.py", line 52, in <module>
extracted = store.select('bob',start=0,stop=5)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 665, in select
return it.get_result()
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 1359, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 658, in func
columns=columns, **kwargs)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 3999, in read
cols.set_names(names, inplace=True)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\core\index.py", line 529, in set_names
idx._set_names(names, level=level)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\core\index.py", line 3274, in _set_names
'Length of names must match number of levels in MultiIndex.')
ValueError: Length of names must match number of levels in MultiIndex.
Closing remaining open files:test.h5...done
Upvotes: 1
Views: 9764
Reputation: 128918
You can't store an index like that, that is partially tuples and partially strings. Its not a real MultiIndex (nor useful in any way).
I would simply use a regular index. You could also do this with a MultiIndex, but every column has to be part of it.
In [67]: pivoted = frame.pivot_table('colB',index='colA',columns='colC',aggfunc='sum')
In [68]: pivoted
Out[68]:
colC a b c d e f g
colA
1 5 5 5 NaN NaN NaN NaN
2 5 5 5 NaN NaN NaN NaN
3 5 5 5 NaN NaN NaN NaN
4 NaN NaN NaN 5 5 5 5
In [69]: df = pd.concat([pivoted,mergedFrameA],axis=1)
In [70]: df
Out[70]:
colC a b c d e f g colD colE
colA
1 5 5 5 NaN NaN NaN NaN ap 100
2 5 5 5 NaN NaN NaN NaN bp 200
3 5 5 5 NaN NaN NaN NaN cp 300
4 NaN NaN NaN 5 5 5 5 dp 400
Upvotes: 1