Concatenate pandas dataframes with varying rows per index

Question

I have two dataframes df1 and df2 with key as index.

dict_1={'key':[1,1,1,2,2,3], 'col1':['a1','b1','c1','d1','e1','f1']}
df1 = pd.DataFrame(dict_1).set_index('key')
dict_2={'key':[1,1,2], 'col2':['a2','b2','c2']}
df2 = pd.DataFrame(dict_2).set_index('key')

df1:

df2

Note that there are unequal rows for each index. I want to concatenate these two dataframes such that, I have the following dataframe (say df3).

df3

       col1 col2
key 
1       a1   a2
1       b1   b2
2       d1   c2

i.e. concatenate the two columns so that the new dataframe as the least (of df1 and df2) rows for each index.

I tried

pd.concat([df1,df2],axis=1)

but I get the following error:

Value Error: Shape of passed values is (2,17), indices imply (2,7)

My question: How can I concatentate df1 and df2 to get df3? Should I use DataFrame.merge instead? If so, how?

ptrj · Accepted Answer

Merge/join alone will get you a lot of (hard to get rid of) duplicates. But a little trick will help:

df1['count1'] = 1
df1['count1'] = df1['count1'].groupby(df1.index).cumsum()
df1
Out[198]: 
    col1  count1
key             
1     a1       1
1     b1       2
1     c1       3
2     d1       1
2     e1       2
3     f1       1

The same thing for df2:

df2['count2'] = 1
df2['count2'] = df2['count2'].groupby(df2.index).cumsum()

And finally:

df_aligned = df1.reset_index().merge(df2.reset_index(), left_on = ['key','count1'], right_on = ['key', 'count2'])
df_aligned
Out[199]: 
   key col1  count1 col2  count2
0    1   a1       1   a2       1
1    1   b1       2   b2       2
2    2   d1       1   c2       1

Now, you can reset index with set_index('key') and drop no longer needed columns countn.

Concatenate pandas dataframes with varying rows per index

Answers (2)

Related Questions