Reputation: 2024
I have a two pandas DataFrame df1
and df2
. The relationship between them is one-to-many, and in some instances it can be one-to-one. When the relationship is one-to-many, I'd like to join columns with certain conditions. I'll illustrate with some data.
import pandas as pd
df1 = pd.DataFrame({
'vid': [1, 2, 3, 4, 5],
'lid': [6, 7, 8, 9, 10],
'v': [3, 5, 6, 1, 9]
})
df2 = pd.DataFrame({
'lid': [6, 6, 8, 8, 10],
'av': ['$10','$5','$4','$3','$2'],
'cr': [0.04, 0.05, 0.03, 0.04, 0.01]
})
For rows where there are multiple joins in df2
i.e lid
6
and 8
, I'd like to apply some function say, get the max
of av
and cr
.
Expected output:
vid lid v av cr
1 6 3 $10 0.05
2 7 5 np.nan np.nan
3 8 6 $5 0.04
4 9 1 np.nan np.nan
5 10 9 $2 0.01
Upvotes: 1
Views: 76
Reputation: 863531
For match by max or by min by both columns create helper column tmp
and join new DataFrame created by sorting per columns lid
and tmp
with remove duplicates per lid
:
df2['tmp'] = list(zip(df2['av'].str.strip('$').astype(int), df2['cr']))
#sorting by ascending and desceding for match by maximal of tuple in col tmp
df = (df1.merge(df2.sort_values(['lid','tmp'], ascending=[True, False])
.drop_duplicates('lid'), how='left', on='lid')
.drop('tmp', axis=1))
print (df)
vid lid v av cr
0 1 6 3 $10 0.04
1 2 7 5 NaN NaN
2 3 8 6 $4 0.03
3 4 9 1 NaN NaN
4 5 10 9 $2 0.01
df2['tmp'] = list(zip(df2['av'].str.strip('$').astype(int), df2['cr']))
#sorting both ascending for match by minimal of tuple in col tmp
df = (df1.merge(df2.sort_values(['lid','tmp'])
.drop_duplicates('lid'), how='left', on='lid')
.drop('tmp', axis=1))
print (df)
vid lid v av cr
0 1 6 3 $5 0.05
1 2 7 5 NaN NaN
2 3 8 6 $3 0.04
3 4 9 1 NaN NaN
4 5 10 9 $2 0.01
EDIT: If aggregate max
or mean
aggregation working for each column separately, so ouput is different like solutions above:
df2['tmp'] = df2['av'].str.strip('$').astype(int)
df = df1.merge(df2.groupby('lid').max(), how='left', on='lid')
print (df)
vid lid v av cr tmp
0 1 6 3 $5 0.05 10.0
1 2 7 5 NaN NaN NaN
2 3 8 6 $4 0.04 4.0
3 4 9 1 NaN NaN NaN
4 5 10 9 $2 0.01 2.0
df = df1.merge(df2.groupby('lid').mean(), how='left', on='lid')
print (df)
vid lid v cr tmp
0 1 6 3 0.045 7.5
1 2 7 5 NaN NaN
2 3 8 6 0.035 3.5
3 4 9 1 NaN NaN
4 5 10 9 0.010 2.0
Upvotes: 1