luftgekuhltlover
luftgekuhltlover

Reputation: 117

How to remove rows for which the date difference is bigger or smaller than

I have two dataframes and they both have a datetime column.

df1 = pd.DataFrame({"participant":{"0":1,"1":1,"2":1,"3":2,"4":3},"createdOn":{"0":"2020-04-09 19:11:49","1":"2020-04-09 19:12:36","2":"2020-04-09 19:13:12","3":"2020-04-09 19:17:36","4":"2020-04-09 19:21:16"},"id_sample":{"0":20,"1":21,"2":22,"3":23,"4":24}," med":{"0":True,"1":True,"2":True,"3":True,"4":False}})
df2 = pd.DataFrame({"participant":{"0":1,"1":1,"2":1,"3":3,"4":2},"createdOn":{"0":"2020-04-14 19:11:49","1":"2020-04-09 19:12:36","2":"2020-04-09 19:13:12","3":"2020-04-09 19:17:36","4":"2020-04-03 19:21:16"},"first":{"0":34.1,"1":54.2,"2":33.2,"3":54,"4":59.7}})
 

df1

participant createdOn           id_sample   med
1           2020-04-09 19:11:49    20       True
1           2020-04-09 19:12:36    21       True
1           2020-04-09 19:13:12    22       True
2           2020-04-09 19:17:36    23       True
3           2020-04-09 19:21:16    24       False


df2

participant   createdOn             first
    1         2020-04-14 19:11:49   34.1
    1         2020-04-09 19:12:36   54.2
    1         2020-04-09 19:13:12   33.2
    3         2020-04-09 19:17:36   54.0
    2         2020-04-03 19:21:16   59.7

I want to remove from df1 those results the date of which createdOn is "far" apart from any row from df2 of the same participant where the difference in the date is +/- 5 days. In this case from df1 the row with participant 2 will be dropped because the time difference from df2 is more than 5 days.

any help would be highly appreciated.

Upvotes: 2

Views: 607

Answers (1)

tdy
tdy

Reputation: 41327

  1. Perform an outer merge between df1 and df2[['participant', 'createdOn']]:

    df = df1.merge(df2[['participant', 'createdOn']], on='participant', how='outer', suffixes=('','2'))
    
    #    participant           createdOn    med          createdOn2
    # 0            1 2020-04-09 19:11:49   True 2020-04-14 19:11:49
    # 1            1 2020-04-09 19:11:49   True 2020-04-09 19:13:12
    # 2            1 2020-04-09 19:12:36   True 2020-04-14 19:11:49
    # 3            1 2020-04-09 19:12:36   True 2020-04-09 19:13:12
    # 4            1 2020-04-09 19:13:12   True 2020-04-14 19:11:49
    # 5            1 2020-04-09 19:13:12   True 2020-04-09 19:13:12
    # 6            2 2020-04-09 19:17:36   True 2020-04-03 19:21:16
    # 7            3 2020-04-09 19:21:16  False 2020-04-09 19:12:36
    # 8            3 2020-04-09 19:21:16  False 2020-04-09 19:17:36
    
  2. Generate a mask where the abs value of the createdOn difference is within (le) the threshold (5 days):

    threshold = pd.Timedelta(days=5)
    mask = df.createdOn2.sub(df.createdOn).abs().le(threshold)
    
    # 0     True
    # 1     True
    # 2     True
    # 3     True
    # 4     True
    # 5     True
    # 6    False
    # 7     True
    # 8     True
    # dtype: bool
    
  3. Apply the mask and drop_duplicates:

    df.loc[mask, df1.columns].drop_duplicates()
    
    # participant           createdOn    med
    #           1 2020-04-09 19:11:49   True
    #           1 2020-04-09 19:12:36   True
    #           1 2020-04-09 19:13:12   True
    #           3 2020-04-09 19:21:16  False
    

Upvotes: 4

Related Questions