Reputation: 117
I have two dataframes and they both have a datetime column.
df1 = pd.DataFrame({"participant":{"0":1,"1":1,"2":1,"3":2,"4":3},"createdOn":{"0":"2020-04-09 19:11:49","1":"2020-04-09 19:12:36","2":"2020-04-09 19:13:12","3":"2020-04-09 19:17:36","4":"2020-04-09 19:21:16"},"id_sample":{"0":20,"1":21,"2":22,"3":23,"4":24}," med":{"0":True,"1":True,"2":True,"3":True,"4":False}})
df2 = pd.DataFrame({"participant":{"0":1,"1":1,"2":1,"3":3,"4":2},"createdOn":{"0":"2020-04-14 19:11:49","1":"2020-04-09 19:12:36","2":"2020-04-09 19:13:12","3":"2020-04-09 19:17:36","4":"2020-04-03 19:21:16"},"first":{"0":34.1,"1":54.2,"2":33.2,"3":54,"4":59.7}})
df1
participant createdOn id_sample med
1 2020-04-09 19:11:49 20 True
1 2020-04-09 19:12:36 21 True
1 2020-04-09 19:13:12 22 True
2 2020-04-09 19:17:36 23 True
3 2020-04-09 19:21:16 24 False
df2
participant createdOn first
1 2020-04-14 19:11:49 34.1
1 2020-04-09 19:12:36 54.2
1 2020-04-09 19:13:12 33.2
3 2020-04-09 19:17:36 54.0
2 2020-04-03 19:21:16 59.7
I want to remove from df1 those results the date of which createdOn
is "far" apart from any row from df2 of the same participant
where the difference in the date is +/- 5 days.
In this case from df1 the row with participant
2 will be dropped because the time difference from df2 is more than 5 days.
any help would be highly appreciated.
Upvotes: 2
Views: 607
Reputation: 41327
Perform an outer merge
between df1
and df2[['participant', 'createdOn']]
:
df = df1.merge(df2[['participant', 'createdOn']], on='participant', how='outer', suffixes=('','2'))
# participant createdOn med createdOn2
# 0 1 2020-04-09 19:11:49 True 2020-04-14 19:11:49
# 1 1 2020-04-09 19:11:49 True 2020-04-09 19:13:12
# 2 1 2020-04-09 19:12:36 True 2020-04-14 19:11:49
# 3 1 2020-04-09 19:12:36 True 2020-04-09 19:13:12
# 4 1 2020-04-09 19:13:12 True 2020-04-14 19:11:49
# 5 1 2020-04-09 19:13:12 True 2020-04-09 19:13:12
# 6 2 2020-04-09 19:17:36 True 2020-04-03 19:21:16
# 7 3 2020-04-09 19:21:16 False 2020-04-09 19:12:36
# 8 3 2020-04-09 19:21:16 False 2020-04-09 19:17:36
Generate a mask
where the abs
value of the createdOn
difference is within (le
) the threshold (5 days):
threshold = pd.Timedelta(days=5)
mask = df.createdOn2.sub(df.createdOn).abs().le(threshold)
# 0 True
# 1 True
# 2 True
# 3 True
# 4 True
# 5 True
# 6 False
# 7 True
# 8 True
# dtype: bool
Apply the mask
and drop_duplicates
:
df.loc[mask, df1.columns].drop_duplicates()
# participant createdOn med
# 1 2020-04-09 19:11:49 True
# 1 2020-04-09 19:12:36 True
# 1 2020-04-09 19:13:12 True
# 3 2020-04-09 19:21:16 False
Upvotes: 4