Pandas: Concatenate dataframe and keep duplicate indices

Question

I have two dataframes that I would like to concatenate column-wise (axis=1) with an inner join. One of the dataframes has some duplicate indices, but the rows are not duplicates, and I don't want to lose the data from those :

df1 = pd.DataFrame([{'a':1,'b':2},{'a':1,'b':3},{'a':2,'b':4}],
                   columns = ['a','b']).set_index('a')

df2 = pd.DataFrame([{'a':1,'c':5},{'a':2,'c':6}],columns = ['a','c']).set_index('a')

>>> df1
   b
a   
1  2
1  3
2  4
8  9

>>> df2
   c
a   
1  5
2  6

The default concat behavior is to fill missing values with NaNs:

>>> pd.concat([df1,df2])
    b   c
a
1   2 NaN
1   3 NaN
2   4 NaN
1 NaN   5
2 NaN   6

I want to keep the duplicate indices from df1 and fill them with duplicated values from df2, but in pandas 0.13.1 an inner join on the columns produces an error. In more recent versions of pandas concat does what I want:

>>> pd.concat([df1, df2], axis=1, join='inner')
   b  c
a      
1  2  5
1  3  5
2  4  6

What's the best way to achieve the result I want? Is there a groupby solution? Or maybe I shouldn't be using concat at all?

EdChum · Accepted Answer

You can perform a merge and set the params to use the index from the lhs and rhs:

In [4]:    
df1.merge(df2, left_index=True, right_index=True)
Out[4]:
   b  c
a      
1  2  5
1  3  5
2  4  6

[3 rows x 2 columns]

Concat should've worked, it worked for me:

In [5]:

pd.concat([df1,df2], join='inner', axis=1)
Out[5]:
   b  c
a      
1  2  5
1  3  5
2  4  6

[3 rows x 2 columns]

Pandas: Concatenate dataframe and keep duplicate indices

Answers (2)

Related Questions