Reputation: 91
I am trying to combine two dataframes which both contain a column of repeated values but not the same number of repeats.
import pandas as pd
df1 = pd.DataFrame({'col1':[1, 1, 2, 2, 3, 3, 3],
'col2':[1.1, 1.3, 2.1, 2.3, 3.1, 3.3, 3.5]})
df2 = pd.DataFrame({'col1':[1, 2, 2, 3, 3, 3],
'col2':[1.2, 2.2, 2.4, 3.2, 3.4, 3.6]})
df1
col1 col2
0 1 1.1
1 1 1.3
2 2 2.1
3 2 2.3
4 3 3.1
5 3 3.3
6 3 3.5
df2
col1 col2
0 1 1.2
1 2 2.2
2 2 2.4
3 3 3.2
4 3 3.4
5 3 3.6
The desired output would be for example:
desired_result = pd.DataFrame({'col1': [1, 1, 2, 2, 3, 3, 3],
'col2_x':[1.1, 1.3, 2.1, 2.3, 3.1, 3.3, 3.5],
'col2_y':[1.2, 'NaN' , 2.2, 2.4, 3.2, 3.4, 3.6]})
desired_result
col1 col2_x col2_y
0 1 1.1 1.2
1 1 1.3 NaN
2 2 2.1 2.2
3 2 2.3 2.4
4 3 3.1 3.2
5 3 3.3 3.4
6 3 3.5 3.6
The problem is the ambiguity in how to combine the two dataframes on col1 which contains repeated values and a direct matching is not possible (and also not necessary).
Upvotes: 1
Views: 77
Reputation: 76297
You can groupby
and horizontal-concat
each of the groups. After that, it's just some column/index manipulation:
In [75]: merged = df1.groupby(df1.col1).apply(lambda g: pd.concat([g[['col2']].reset_index(), df2[['col2']][df2.col1 == g.col1.values[0]].reset_index()], axis=1))
In [76]: merged.columns = ['_', 'col2_x', '__', 'col2_y']
In [77]: merged.reset_index()[['col1', 'col2_x', 'col2_y']]
Out[77]:
col1 col2_x col2_y
0 1 1.1 1.2
1 1 1.3 NaN
2 2 2.1 2.2
3 2 2.3 2.4
4 3 3.1 3.2
5 3 3.3 3.4
6 3 3.5 3.6
Upvotes: 1