Reputation: 277
I have two pandas data frames. The first is:
df1 = pd.DataFrame({"val1" : ["B2","A1","B2","A1","B2","A1"]})
The second data frame is:
df2 = pd.DataFrame({"val1" : ["A1","A1","A1","B2","B2","B2"],
"val2" : [10, 13, 16, 11, 20, 22]})
I would like to merge the two together in a way in which the row ordering from df1 is used and the values from df2 follow this ordering. Ideally, I would like it to look like this:
df_final = pd.DataFrame({"val1" : ["B2","A1","B2","A1","B2","A1"],
"val2" : [11, 10, 20, 13, 22, 16]})
I've tried using the merge function with left_on and right_on, but I don't get the output I'm looking for. Any help would be greatly appreciated.
Upvotes: 4
Views: 381
Reputation: 879461
You could use groupby/cumcount
to assign a unique number to each row within each group:
df1['cumcount'] = df1.groupby('val1').cumcount()
# val1 cumcount
# 0 B2 0
# 1 A1 0
# 2 B2 1
# 3 A1 1
# 4 B2 2
# 5 A1 2
If we do the same to df2
:
df2['cumcount'] = df2.groupby('val1').cumcount()
# val1 val2 cumcount
# 0 A1 10 0
# 1 A1 13 1
# 2 A1 16 2
# 3 B2 11 0
# 4 B2 20 1
# 5 B2 22 2
then merging df1
with df2
on the common columns (val1
and cumcount
) produces the desired result:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"val1" : ["B2","A1","B2","A1","B2","A1"]})
df2 = pd.DataFrame({"val1" : ["A1","A1","A1","B2","B2","B2"],
"val2" : [10, 13, 16, 11, 20, 22]})
df_final = pd.DataFrame({"val1" : ["B2","A1","B2","A1","B2","A1"],
"val2" : [11, 10, 20, 13, 22, 16]})
df1['cumcount'] = df1.groupby('val1').cumcount()
df2['cumcount'] = df2.groupby('val1').cumcount()
result = pd.merge(df1, df2, how='left')
result = result.drop('cumcount', axis=1)
print(result)
assert result.equals(df_final)
yields
val1 val2
0 B2 11
1 A1 10
2 B2 20
3 A1 13
4 B2 22
5 A1 16
Note that merging with how='left'
produces a result with the same number of rows as the first DataFrame, df1
, and maintains the same order of rows as df1
.
Upvotes: 0
Reputation: 210832
You can do it this way:
df2
by ['val1', 'val2']
, group it by val1
and store it as g2
?idx
column to df1
which will be used in order to pick values from df2
Code:
In [176]: df1['idx'] = 1
In [177]: df1['idx'] = df1.groupby('val1')['idx'].cumsum()-1
In [178]: df1
Out[178]:
val1 idx
0 B2 0
1 A1 0
2 B2 1
3 A1 1
4 B2 2
5 A1 2
In [179]: g2 = df2.sort_values(['val1', 'val2']).groupby('val1')
In [180]: g2.groups
Out[180]: {'A1': [0, 1, 2], 'B2': [3, 4, 5]}
In [181]: df2.iloc[g2.groups['A1'][1]]
Out[181]:
val1 A1
val2 13
Name: 1, dtype: object
In [182]: df1.apply(lambda x: df2.iloc[g2.groups[x['val1']][x['idx']]], axis=1)
Out[182]:
val1 val2
0 B2 11
1 A1 10
2 B2 20
3 A1 13
4 B2 22
5 A1 16
Upvotes: 1