Reputation: 725
I have two dataframes that I would like to concatenate column-wise (axis=1) with an inner join. One of the dataframes has some duplicate indices, but the rows are not duplicates, and I don't want to lose the data from those :
df1 = pd.DataFrame([{'a':1,'b':2},{'a':1,'b':3},{'a':2,'b':4}],
columns = ['a','b']).set_index('a')
df2 = pd.DataFrame([{'a':1,'c':5},{'a':2,'c':6}],columns = ['a','c']).set_index('a')
>>> df1
b
a
1 2
1 3
2 4
8 9
>>> df2
c
a
1 5
2 6
The default concat
behavior is to fill missing values with NaNs:
>>> pd.concat([df1,df2])
b c
a
1 2 NaN
1 3 NaN
2 4 NaN
1 NaN 5
2 NaN 6
I want to keep the duplicate indices from df1 and fill them with duplicated values from df2, but in pandas 0.13.1 an inner join on the columns produces an error. In more recent versions of pandas concat does what I want:
>>> pd.concat([df1, df2], axis=1, join='inner')
b c
a
1 2 5
1 3 5
2 4 6
What's the best way to achieve the result I want? Is there a groupby solution? Or maybe I shouldn't be using concat
at all?
Upvotes: 13
Views: 30102
Reputation: 285
Coming back to this because I was looking how to merge on columns with different names and keep duplicates:
df1 = pd.DataFrame([{'a':1,'b':2},{'a':1,'b':3},{'a':2,'b':4}],
columns = ['a','b'])
df1
a b
0 1 2
1 1 3
2 2 4
df2 = pd.DataFrame([{'c':1,'d':5},{'c':2,'d':6}],
columns = ['c','d'])
df2
c d
0 1 5
1 2 6
And found that pd.merge(df1, df2.set_index('c'), left_on='a', right_index=True)
accomplished this:
df3
a b d
0 1 2 5
1 1 3 5
2 2 4 6
You could also .set_index('a'), left_on='a'
if the column names are the same (as per OP example)
Upvotes: 0
Reputation: 393963
You can perform a merge and set the params to use the index from the lhs and rhs:
In [4]:
df1.merge(df2, left_index=True, right_index=True)
Out[4]:
b c
a
1 2 5
1 3 5
2 4 6
[3 rows x 2 columns]
Concat should've worked, it worked for me:
In [5]:
pd.concat([df1,df2], join='inner', axis=1)
Out[5]:
b c
a
1 2 5
1 3 5
2 4 6
[3 rows x 2 columns]
Upvotes: 8