AJG519
AJG519

Reputation: 3379

Joining Pandas Dataframes on shared indexes

I have 3 DataFrames that have differing numbers of shared indexes. For example:

>>> df0=pd.DataFrame(index=pd.MultiIndex.from_product([[1,2,3,4],[2011,2012],['A','B']], names=['Season','Year','Location']))
>>> df0['Value0']=np.random.randint(1,100,len(df0))
>>> 
>>> df1=pd.DataFrame(index=pd.MultiIndex.from_product([[2011,2012],['A','B']], names=['Year','Location']))
>>> df1['Value1']=np.random.randint(1,100,len(df1))
>>> 
>>> df2=pd.DataFrame(index=['A','B'])
>>> df2.index.name='Location'
>>> df2['Value2']=np.random.randint(1,100,len(df2))
>>> df0
                      Value0
Season Year Location        
1      2011 A             18
            B             63
       2012 A             88
            B             30
2      2011 A             35
            B             60
       2012 A             61
            B              4
3      2011 A             70
            B              9
       2012 A             11
            B             38
4      2011 A             68
            B             57
       2012 A             13
            B             35
>>> df1
               Value1
Year Location        
2011 A             22
     B             74
2012 A             73
     B             44
>>> df2
          Value2
Location        
A             70
B             85
>>> 

I am looking for the best way to join them on their common indexes.

Things I have tried:

1) pd.concat([df0,df1,df2],1) would be nice because it accepts a list of dataframes, but this appears to only work if the dataframes have the same number of indexes.

2) Joining either of the multiindex dataframes with the single index DataFrame works: df1.join(df2) or df0.join(df2). However joining the DataFrame with two indexes with the DataFrame with three indexes does not: df0.join(df1) and gives me the following error: "NotImplementedError: merging with more than one level overlap on a multi-index is not implemented"

At this point, the approach I have been taking is to reset the indexes and use pd.merge(). See below:

def JoinMulti(DFList):
    FinalDF=DFList[0].reset_index()
    for OtherDF in DFList[1:]:
        FinalDF=pd.merge(FinalDF, OtherDF.reset_index(), how='outer')

    #Now I want reindex it so that it's indexed the same as the `DataFrame` with the highest number of levels
    NLevels=[x.index.nlevels for x in DFList]
    MaxIndexPos=NLevels.index(max(NLevels))
    FinalIndex=DFList[MaxIndexPos].index
    FinalDF=FinalDF.set_index(FinalIndex.names).reindex(FinalIndex)
    return FinalDF

>>> JoinMulti([df0,df1,df2])
                      Value0  Value1  Value2
Season Year Location                        
1      2011 A             43       5      96
            B             63      46      97
       2012 A             68       6      96
            B             23      99      97
2      2011 A             66       5      96
            B             30      46      97
       2012 A             45       6      96
            B             79      99      97
3      2011 A             66       5      96
            B             21      46      97
       2012 A             86       6      96
            B             11      99      97
4      2011 A             95       5      96
            B             58      46      97
       2012 A             32       6      96
            B             80      99      97
>>> 

Is this a reasonable approach? Are there any improvements that could be made or any exceptions that I'm forgetting about?

Upvotes: 6

Views: 3887

Answers (1)

jezrael
jezrael

Reputation: 863116

I modified solution by Stefan Jansen:

def jez(df0,df1,df2):
    df1 = df1.join(df2)
    df0 = df0.reset_index('Season')
    FinalDF = df0.join(df1).set_index('Season', append=True).reorder_levels(['Season', 'Year', 'Location']).sortlevel()
    return FinalDF

print jez(df0,df1,df2)

Timing:

In [41]: %timeit jez(df0,df1,df2)
The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 5.02 ms per loop

In [42]: %timeit JoinMulti([df0,df1,df2])
100 loops, best of 3: 9.83 ms per loop

Upvotes: 1

Related Questions