PTTHomps
PTTHomps

Reputation: 1499

Trouble retrieving multi-index Pandas DataFrame from HDFStore (in Table format)

I have some code, reduced to the example included below, which takes some raw data, creates a pivot table from it, then merges it with another dataframe, and finally stores the results in an HDFStore object. If it's stored in fixed format, it can be retrieved just fine. However, if stored in Table format, it produces an error. I need table format so that I can extract chunks at a time (total dataset is tens of millions of rows).

Any ideas what the problem is?

Code sample:

import pandas as pd

def createFrame():
    data = {
             'colA':[1,1,1,2,2,2,3,3,3,4,4,4,4],
             'colB':[5,5,5,5,5,5,5,5,5,5,5,5,5],
             'colC':['a','b','c','a','b','c','a','b','c','d','e','f','g'],
             'colD':['ap','ap','ap','bp','bp','bp','cp','cp','cp','dp','dp','dp','dp']
           }
    frame = pd.DataFrame(data)
    return frame

def createOtherFrame():
    data = {
             'colD':['ap','bp','cp','dp'],
             'colE':[100,200,300,400]
           }
    frame = pd.DataFrame(data)
    return frame 

if __name__ == '__main__':
    pd.set_option('display.width', 120) # default is 80
    pd.set_option('io.hdf.default_format','table')    

    pivotStore = pd.HDFStore('test.h5')
    frame = createFrame()
    otherFrame = createOtherFrame()
    pivoted = frame.pivot_table(['colB'],
                                index=['colA'],
                                columns='colC',
                                aggfunc='sum'
                                )
    print(pivoted)
    print('')    
#    print(pivoted.info(),end='\n\n')

    mergedFrameA = pd.merge(frame[['colA','colD']].drop_duplicates(),
                        otherFrame, 
                        how = 'left',
                        on='colD'
                       ).set_index('colA')
#    print(mergedFrameA.info())
    print(mergedFrameA)

    mergedFrameB = pd.merge(pivoted,mergedFrameA,how='left',left_index=True,right_index=True)
#    print(mergedFrameB.info())
    print(mergedFrameB)

    pivotStore['bob'] = mergedFrameB
    pivotStore.close()
    store = pd.HDFStore('test.h5')
    extracted = store.select('bob',start=0,stop=5)
    print(extracted)
    store.close()

Output produced (with error):

     colB                        
colC    a   b   c   d   e   f   g
colA                             
1       5   5   5 NaN NaN NaN NaN
2       5   5   5 NaN NaN NaN NaN
3       5   5   5 NaN NaN NaN NaN
4     NaN NaN NaN   5   5   5   5

     colD  colE
colA           
1      ap   100
2      bp   200
3      cp   300
4      dp   400
      (colB, a)  (colB, b)  (colB, c)  (colB, d)  (colB, e)  (colB, f)  (colB, g) colD  colE
colA                                                                                        
1             5          5          5        NaN        NaN        NaN        NaN   ap   100
2             5          5          5        NaN        NaN        NaN        NaN   bp   200
3             5          5          5        NaN        NaN        NaN        NaN   cp   300
4           NaN        NaN        NaN          5          5          5          5   dp   400
Traceback (most recent call last):
  File "C:\multiindextest.py", line 52, in <module>
    extracted = store.select('bob',start=0,stop=5)
  File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 665, in select
    return it.get_result()
  File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 1359, in get_result
    results = self.func(self.start, self.stop, where)
  File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 658, in func
    columns=columns, **kwargs)
  File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 3999, in read
    cols.set_names(names, inplace=True)
  File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\core\index.py", line 529, in set_names
    idx._set_names(names, level=level)
  File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\core\index.py", line 3274, in _set_names
    'Length of names must match number of levels in MultiIndex.')
ValueError: Length of names must match number of levels in MultiIndex.
Closing remaining open files:test.h5...done

Upvotes: 1

Views: 9764

Answers (1)

Jeff
Jeff

Reputation: 128918

You can't store an index like that, that is partially tuples and partially strings. Its not a real MultiIndex (nor useful in any way).

I would simply use a regular index. You could also do this with a MultiIndex, but every column has to be part of it.

In [67]: pivoted = frame.pivot_table('colB',index='colA',columns='colC',aggfunc='sum')

In [68]: pivoted
Out[68]: 
colC   a   b   c   d   e   f   g
colA                            
1      5   5   5 NaN NaN NaN NaN
2      5   5   5 NaN NaN NaN NaN
3      5   5   5 NaN NaN NaN NaN
4    NaN NaN NaN   5   5   5   5

In [69]: df = pd.concat([pivoted,mergedFrameA],axis=1)

In [70]: df
Out[70]: 
colC   a   b   c   d   e   f   g colD  colE
colA                                       
1      5   5   5 NaN NaN NaN NaN   ap   100
2      5   5   5 NaN NaN NaN NaN   bp   200
3      5   5   5 NaN NaN NaN NaN   cp   300
4    NaN NaN NaN   5   5   5   5   dp   400

Upvotes: 1

Related Questions