Hummer
Hummer

Reputation: 445

Merging two dataframes with a common and uncommon keys in Pandas

I have the below dataframes that I would like to merge. Merge df2 to df1.

df1  = {'ID1': ['A12', 'A13', 'A14','A15'], 'ArrivalDateTime': ["2021-09-20", "2021-09-21", "2021-09-26","2022-01-20"]} 
df2  = {'ID': ['A12', 'A12', '003','A15','A15'], 'ArrivalDateCap': ["2021-09-17", "2021-09-21", "2021-09-20","2022-01-17","2022-01-19"]}  
  
df1 = pd.DataFrame(df1)  
df1["ArrivalDateTime"] = pd.to_datetime(df1["ArrivalDateTime"],format="%Y-%m-%d %H:%M") 

df2 = pd.DataFrame(df2)
df2["ArrivalDateCap"] = pd.to_datetime(df2["ArrivalDateCap"],format="%Y-%m-%d %H:%M")

The common columns are ID1 in df1 and ID in df2. In addition, if they match(left join df1 and df2), I need to look for ArrivalDateCap less than but closest to the value in ArrivalDateTime for that record. That's the record from df2 that should be picked if more than one matching records exist. Sample output should be as below:

output_df  = {'ID': ['A12', 'A13', 'A14','A15'], 'ArrivalDateTime': ["2021-09-20", "2021-09-21", "2021-09-26","2022-01-20"],'ID1':['A12','A13','A14','A15'],'ArrivalDateCap':['2021-09-17',np.NaN,np.NaN,'2022-01-19']}  
output_df = pd.DataFrame(output_df)  

Can I use pandas.merge_asof for this? If so, how can I define the ID1 and ID as well as ArrivalDateTime and ArrivalDateCap for this to work?

Upvotes: 0

Views: 34

Answers (1)

jezrael
jezrael

Reputation: 862921

Use merge_asof first:

df = pd.merge_asof(df1, 
                   df2.sort_values('ArrivalDateCap'), 
                   left_by='ID1', 
                   right_by='ID', 
                   left_on='ArrivalDateTime', 
                   right_on='ArrivalDateCap')
print (df)
   ID1 ArrivalDateTime   ID ArrivalDateCap
0  A12      2021-09-20  A12     2021-09-17
1  A13      2021-09-21  NaN            NaT
2  A14      2021-09-26  NaN            NaT
3  A15      2022-01-20  A15     2022-01-19

And then for replace missing values use Series.fillna:

df['ID'] = df['ID'].fillna(df['ID1'])
print (df)
   ID1 ArrivalDateTime   ID ArrivalDateCap
0  A12      2021-09-20  A12     2021-09-17
1  A13      2021-09-21  A13            NaT
2  A14      2021-09-26  A14            NaT
3  A15      2022-01-20  A15     2022-01-19

Upvotes: 1

Related Questions