kms
kms

Reputation: 2024

Merge/Join pandas dataframe with condition

I have a two pandas DataFrame df1 and df2. The relationship between them is one-to-many, and in some instances it can be one-to-one. When the relationship is one-to-many, I'd like to join columns with certain conditions. I'll illustrate with some data.

import pandas as pd

df1 = pd.DataFrame({
                    'vid': [1, 2, 3, 4, 5],
                    'lid': [6, 7, 8, 9, 10],
                    'v': [3, 5, 6, 1, 9]
                  })

df2 = pd.DataFrame({
                    'lid': [6, 6, 8, 8, 10],
                    'av': ['$10','$5','$4','$3','$2'],
                    'cr': [0.04, 0.05, 0.03, 0.04, 0.01]
                  })

For rows where there are multiple joins in df2 i.e lid 6 and 8, I'd like to apply some function say, get the max of av and cr.

Expected output:

vid lid  v  av      cr
1    6   3  $10     0.05
2    7   5  np.nan  np.nan
3    8   6  $5      0.04
4    9   1  np.nan  np.nan
5    10  9  $2      0.01

Upvotes: 1

Views: 76

Answers (1)

jezrael
jezrael

Reputation: 863531

For match by max or by min by both columns create helper column tmp and join new DataFrame created by sorting per columns lid and tmp with remove duplicates per lid:

df2['tmp'] = list(zip(df2['av'].str.strip('$').astype(int), df2['cr']))

#sorting by ascending and desceding for match by maximal of tuple in col tmp
df = (df1.merge(df2.sort_values(['lid','tmp'], ascending=[True, False])
                   .drop_duplicates('lid'), how='left', on='lid')
                   .drop('tmp', axis=1))
print (df)
   vid  lid  v   av    cr
0    1    6  3  $10  0.04
1    2    7  5  NaN   NaN
2    3    8  6   $4  0.03
3    4    9  1  NaN   NaN
4    5   10  9   $2  0.01

df2['tmp'] = list(zip(df2['av'].str.strip('$').astype(int), df2['cr']))

#sorting both ascending for match by minimal of tuple in col tmp
df = (df1.merge(df2.sort_values(['lid','tmp'])
                   .drop_duplicates('lid'), how='left', on='lid')
                   .drop('tmp', axis=1))
print (df)
   vid  lid  v   av    cr
0    1    6  3   $5  0.05
1    2    7  5  NaN   NaN
2    3    8  6   $3  0.04
3    4    9  1  NaN   NaN
4    5   10  9   $2  0.01

EDIT: If aggregate max or mean aggregation working for each column separately, so ouput is different like solutions above:

df2['tmp'] = df2['av'].str.strip('$').astype(int)

df = df1.merge(df2.groupby('lid').max(), how='left', on='lid')
print (df)
   vid  lid  v   av    cr   tmp
0    1    6  3   $5  0.05  10.0
1    2    7  5  NaN   NaN   NaN
2    3    8  6   $4  0.04   4.0
3    4    9  1  NaN   NaN   NaN
4    5   10  9   $2  0.01   2.0

df = df1.merge(df2.groupby('lid').mean(), how='left', on='lid')
print (df)
   vid  lid  v     cr  tmp
0    1    6  3  0.045  7.5
1    2    7  5    NaN  NaN
2    3    8  6  0.035  3.5
3    4    9  1    NaN  NaN
4    5   10  9  0.010  2.0

Upvotes: 1

Related Questions