Bhushan
Bhushan

Reputation: 610

Concatenate pandas dataframes with varying rows per index

I have two dataframes df1 and df2 with key as index.

dict_1={'key':[1,1,1,2,2,3], 'col1':['a1','b1','c1','d1','e1','f1']}
df1 = pd.DataFrame(dict_1).set_index('key')
dict_2={'key':[1,1,2], 'col2':['a2','b2','c2']}
df2 = pd.DataFrame(dict_2).set_index('key')

df1:

       col1
key
1       a1
1       b1
1       c1
2       d1
2       e1
3       f1

df2

       col2
key 
1      a2
1      b2
2      c2

Note that there are unequal rows for each index. I want to concatenate these two dataframes such that, I have the following dataframe (say df3).

df3

       col1 col2
key 
1       a1   a2
1       b1   b2
2       d1   c2

i.e. concatenate the two columns so that the new dataframe as the least (of df1 and df2) rows for each index.

I tried

pd.concat([df1,df2],axis=1)

but I get the following error:

Value Error: Shape of passed values is (2,17), indices imply (2,7)

My question: How can I concatentate df1 and df2 to get df3? Should I use DataFrame.merge instead? If so, how?

Upvotes: 2

Views: 751

Answers (2)

ptrj
ptrj

Reputation: 5222

Merge/join alone will get you a lot of (hard to get rid of) duplicates. But a little trick will help:

df1['count1'] = 1
df1['count1'] = df1['count1'].groupby(df1.index).cumsum()
df1
Out[198]: 
    col1  count1
key             
1     a1       1
1     b1       2
1     c1       3
2     d1       1
2     e1       2
3     f1       1

The same thing for df2:

df2['count2'] = 1
df2['count2'] = df2['count2'].groupby(df2.index).cumsum()

And finally:

df_aligned = df1.reset_index().merge(df2.reset_index(), left_on = ['key','count1'], right_on = ['key', 'count2'])
df_aligned
Out[199]: 
   key col1  count1 col2  count2
0    1   a1       1   a2       1
1    1   b1       2   b2       2
2    2   d1       1   c2       1

Now, you can reset index with set_index('key') and drop no longer needed columns countn.

Upvotes: 2

DataSwede
DataSwede

Reputation: 5591

The biggest problem for why you are not going to be able to line up the two in the way that you want is that your keys are duplicative. How are you going to be line up the A1 value in df1 with the A2 value in df2 When A1, A2, B1, B2, and C1 all have the same key?

Using merge is what you'll want if you can resolve the key issues:

df3 = df1.merge(df2, left_index=True, right_index=True, how='inner')

You can use inner, outer, left or right for how.

Upvotes: 0

Related Questions