Merge nearest preceding date and between dates

Question

I have 2 dfs that look like this:

df1 <- data.frame(ID = c("1", "1", "1", "2", "2", "2"),
                  testdate = as.POSIXct(c("2010-3-20", "2018-04-12","2018-04-25","2011-04-17","2011-09-05","2019-04-16")),
                  testvalue = (c(17, 35, 44, 65, 21, 22)))

df2 <- data.frame(ID = c("1", "1", "2", "2", "2"),
                  begindate = as.POSIXct(c("2018-04-10","2018-04-30","2011-04-12","2011-07-15","2018-01-21")),
                  enddate = as.POSIXct(c("2018-04-22","2018-05-12","2011-04-30","2011-07-30","2018-01-29")),
                  Dose = (c("2x per day", "1x per day", "1x morning", "2x morning", "3x per day")))

Df1 has testvalues of subjects on a certain day. Df2 contains the prescriptions a subject had between a certain period.

I would like to merge the two dfs by ID and if testdate falls between df2.begindate and df2.enddate, or if testdate has a preceding "prescription" in df2, then I would like the nearest "prescription" (seen in row 3 of new df).

The final df should look like this, I would also like to keep all data in df1.

   ID   testdate   testvalue  begindate   enddate       dose
1   1   2010-03-20  17        NA          NA            NA
2   1   2018-04-12  35        2018-04-10  2018-04-22    2x per day
3   1   2018-04-25  44        2018-04-10  2018-04-22    2x per day
4   2   2011-04-17  65        2011-04-12  2011-04-30    1x morning
5   2   2011-09-05  21        2011-07-15  2011-07-30    2x morning
6   2   2019-04-16  22        2018-01-21  2018-01-29    3x per day

I have tried this: Find nearest preceding and following dates between data frames, but without success. I keep getting multiple rows not only containing the nearest "prescription" but all from before a certain testdate, which is not what I would like.

EDIT: I have tried this:

setDT(df1)
setDT(df2)

setkey(df1, ID, testdate)
setkey(df2, ID, begindate)[, PrecedingDate:=begindate]

result <- df2[df1, roll=Inf]

But this does not work for row 3 in new df and adjust my testvalue date, which I Would like to have in the df.

G. Grothendieck · Accepted Answer

1) For each row this finds the interval containing testvalue that has the largest begindate and the same ID or if none it finds the interval with the largest begindate no larger than testvalue having the same ID.

First add a row number seq to df1 creating temporary table df1s and then left join each row in df1s to the row in df2 that contains its testvalue and has the same ID and greatest begindate. Also it creates temporary table df1b which finds the greatest begindate prior to the testdate and having the same ID. Finally it joins df1a and df1b on seq taking the begindate, enddate and Dose from df1a if these exist and from df2 if not.

library(sqldf)

sqldf("with df1s as (
  select rowid as seq, * from df1
),
df1a as (  -- nearest preceding containing interval having same ID
  select max(b.begindate) as begindate, a.*, b.begindate, b.enddate, b.Dose
  from df1s a
  left join df2 b on a.ID = b.ID and a.testdate between b.begindate and b.enddate
  group by a.seq),
df1b as (  -- nearest preceding begindate having same ID
  select max(b.begindate), a.*, b.begindate, b.enddate, b.Dose
    from df1s a 
    left join df2 b on a.ID = b.ID and b.begindate <= a.testdate
    group by a.seq)
-- pick out interval in df1a or if none in df1b
select a.ID, a.testdate, a.testvalue, 
    coalesce(a.begindate, b.begindate) as begindate,
    coalesce(a.enddate, b.enddate) as enddate,
    coalesce(a.Dose, b.Dose) as Dose
  from df1a a 
  left join df1b b on a.seq = b.seq")

giving the following using the revised data in the question:

  ID   testdate testvalue  begindate    enddate       Dose
1  1 2010-03-20        17                     
2  1 2018-04-12        35 2018-04-10 2018-04-22 2x per day
3  1 2018-04-25        44 2018-04-10 2018-04-22 2x per day
4  2 2011-04-17        65 2011-04-12 2011-04-30 1x morning
5  2 2011-09-05        21 2011-07-15 2011-07-30 2x morning
6  2 2019-04-16        22 2018-01-21 2018-01-29 3x per day

2) If we reduce (1) to just df1b there we get a much shorter solution though obviously not equivalent. It simply takes the largest begindate in df2 that is no larger than the testdate in df1 and has the same ID. It is possible that it will prefer a match which does not span testvalue even when there is an interval that does in the case that there is an interval that contains the testvalue but another interval is contained in that interval and ends before testvalue; however, other than that it should be ok. Use (1) if it is not ok.

library(sqldf)

sqldf("select a.*, max(b.begindate) as begindate, b.enddate, b.Dose
  from df1 a
  left join df2 b on a.ID = b.ID and b.begindate <= a.testdate
  group by a.rowid")

giving the following using the revised data in the question:

  ID   testdate testvalue  begindate    enddate       Dose
1  1 2010-03-20        17                     
2  1 2018-04-12        35 2018-04-10 2018-04-22 2x per day
3  1 2018-04-25        44 2018-04-10 2018-04-22 2x per day
4  2 2011-04-17        65 2011-04-12 2011-04-30 1x morning
5  2 2011-09-05        21 2011-07-15 2011-07-30 2x morning
6  2 2019-04-16        22 2018-01-21 2018-01-29 3x per day

Merge nearest preceding date and between dates

Answers (2)

Related Questions