Reputation: 426
I have two Dataframes in the form:
Dataframe(df1):
P_CLIENT_ID | P_DATE_ENCOUNTER |
---|---|
25835 | 2016-12-21 |
25835 | 2017-02-21 |
25835 | 2017-04-25 |
25835 | 2017-06-21 |
25835 | 2017-09-04 |
25835 | 2018-01-08 |
25835 | 2018-04-03 |
Dataframe(df2):
R_CLIENT_ID | R_DATE_TESTED | R_RESULT |
---|---|---|
25835 | 2017-03-07 | 20.0 |
25835 | 2017-08-03 | 20.0 |
25835 | 2018-03-23 | 20.0 |
25835 | 2019-06-28 | 20.0 |
25835 | 2019-08-19 | 42.0 |
25835 | 2020-04-20 | 40.0 |
25835 | 2021-06-03 | 20.0 |
I want to merge df2 onto df1 (main table) with the join key being P_CLIENT_ID
and R_CLIENT_ID
appending the most recent R_DATE_TESTED
and R_RESULT
First Condition:
If R_DATE_TESTED > P_DATE_ENCOUNTER
then nullify the R_DATE_TESTED, R_RESULT
fields.
Second Condition:
If R_DATE_TESTED < P_DATE_ENCOUNTER
then appned the most recent R_DATE_TESTED, R_RESULT
fields to the dataframe with the final results as:
The result from logic should be as below:
P_CLIENT_ID | R_CLIENT_ID | P_DATE_ENCOUNTER | R_DATE_TESTED | R_RESULT |
---|---|---|---|---|
25835 | 25835.0 | 2016-12-21 | NaN | NaN |
25835 | 25835.0 | 2017-02-21 | NaN | NaN |
25835 | 25835.0 | 2017-04-25 | 2017-03-07 | 20.0 |
25835 | 25835.0 | 2017-06-21 | 2017-03-07 | 20.0 |
25835 | 25835.0 | 2017-09-04 | 2017-08-03 | 20.0 |
25835 | 25835.0 | 2018-01-08 | 2017-08-03 | 20.0 |
25835 | 25835.0 | 2018-04-03 | 2018-03-23 | 20.0 |
Note: The actual dataset is quite large: df1 ~ 700000 rows and df2 ~ 125000 rows
Code attempt
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'P_CLIENT_D': ['25835','25835','25835','25835','25835','25835','25835'],
'P_DATE_ENCOUNTER': ['2016-12-21','2017-02-21','2017-04-25','2017-06-21','2017-09-04','2018-01-08','2018-04-03']})
df2 = pd.DataFrame({'R_CLIENT_ID': ['25835','25835','25835','25835','25835','25835','25835'],
'R_DATE_TESTED': ['2017-03-07','2017-08-03','2018-03-23','2019-06-28','2019-08-19','2020-04-20','2021-06-03'],
'R_RESULT':[20,20,20,20,42,40,20]})
df_merged = pd.merge(df1, df2, left_on=['P_CLIENT_D'], right_on = ['R_CLIENT_ID'], how='left')
df_merged = df_merged.drop_duplicates(subset=['P_CLIENT_D', 'P_DATE_ENCOUNTER'], keep='last')
df_merged['FLAG_LAB_AFTER_VISIT'] = 0
df_merged.loc[df_merged.R_DATE_TESTED >= df_merged.P_DATE_ENCOUNTER,'FLAG_LAB_AFTER_VISIT']=1
print(df_merged['FLAG_LAB_AFTER_VISIT'].sum(), 'future labs set to null')
#now the rows with flags - set all lab fields to null
df_merged.loc[df_merged['FLAG_LAB_AFTER_VISIT']==1, df2.columns] = np.nan
Upvotes: 0
Views: 1061
Reputation: 23146
Try with pandas.merge_asof
:
>>> pd.merge_asof(df1,
df2,
left_on="P_DATE_ENCOUNTER",
right_on="R_DATE_TESTED",
left_by="P_CLIENT_ID",
right_by="R_CLIENT_ID")
P_CLIENT_ID P_DATE_ENCOUNTER R_CLIENT_ID R_DATE_TESTED R_RESULT
0 25835 2016-12-21 NaN NaT NaN
1 25835 2017-02-21 NaN NaT NaN
2 25835 2017-04-25 25835.0 2017-03-07 20.0
3 25835 2017-06-21 25835.0 2017-03-07 20.0
4 25835 2017-09-04 25835.0 2017-08-03 20.0
5 25835 2018-01-08 25835.0 2017-08-03 20.0
6 25835 2018-04-03 25835.0 2018-03-23 20.0
Upvotes: 1