Reputation: 1
I have two pandas dataframes containing IDs and birth dates. The birth dates in dataframe 1 are only an estimation so in order to capture as many similarities as possible I am looking to match IDs in Dataframe 2 that are within 45 days of the birthdate in Dataframe 1. Example as follows:
df_A:
ID birth_date
A_2 1-1-1945
A_3 1-1-1967
A_4 1-1-2000
A_5 1-1-1994
A_6 2-1-1994
df_B:
ID birth_date
B_1 2-2-1945
B_2 12-25-1944
B_3 1-5-2000
B_4 1-7-1994
Expected Output:
ID_1 ID_2
A_2 B_1
A_2 B_2
A_4 B_3
A_5 B_4
A_6 B_4
Edited in more example to fully show what I would like to receive.
Upvotes: 0
Views: 42
Reputation: 1
I think I figured it out myself. Using the answer I found here: cartesian product in pandas
My code looks like this now:
df_A.loc[:, 'key'] = 1
df_B.loc[:, 'key'] = 1
cart = pd.merge(df_A, df_B, on='key') [['df_A_ID', 'df_A_birth_date', 'df_B_ID', 'df_B_birth_date']]
cart = cart[(cart['df_B_birth_date'] < cart['df_A_birth_date'] + pd.to_timedelta(45, unit='d'))
& (cart['df_B_birth_date'] > cart['df_A_birth_date'] + pd.to_timedelta(-45, unit='d'))].sort_values(['df_A_ID']).reset_index().drop(columns={'index'})
Upvotes: 0
Reputation: 75100
This is merge_asof
with tolerance
:
pd.merge_asof(df2.sort_values('birth_date'),df1.sort_values('birth_date'),
on='birth_date',tolerance=pd.Timedelta(45,unit='D'),direction='nearest'
,suffixes=('_2','_1')).drop('birth_date',1)
ID_2 ID_1
0 B_2 A_2
1 B_1 A_2
2 B_3 A_4
Note, please convert the birth_date
columns to datetime first:
df1.birth_date=pd.to_datetime(df1.birth_date)
df2.birth_date=pd.to_datetime(df2.birth_date)
Upvotes: 2