Casey
Casey

Reputation: 127

R: Finding the closest of two dates according to an index column

The purpose of this project is to measure the time interval between a medical lab check and the most recent dose of a medication. Each patient has a different number of subsequent doses of this medication, and the number of follow-up lab checks are also different for each subject.

The first data frame contains study_id and respective dose dates:

library(dplyr)
library(lubridate)

study_id<- c(1, 1, 1, 2, 2, 3)
dose_dt <- c('1/1/00', '2/1/00', '3/1/00', '1/1/01', '2/1/01', '1/1/02')
doses_df <- data.frame(study_id, dose_dt)
doses_df$dose_dt <- mdy(doses_df$dose_dt)
print(doses_df)

 study_id    dose_dt
1        1 2000-01-01
2        1 2000-02-01
3        1 2000-03-01
4        2 2001-01-01
5        2 2001-02-01
6        3 2002-01-01

The second data frame has matching study_id, this time with lab check dates and associated lab value and an indicator of abnormal (yes/no).

study_id <- c(1, 1, 1, 2, 3, 3, 3)
lab_dt <- c('1/1/99', '3/1/00', '4/1/00', '2/1/01', '2/1/02', '3/1/02', '4/1/02')
lab_result <- c(100, 200, 50, 25, 75, 100, 75)
lab_abn_yn <- c(0, 0, 1, 1, 1, 0, 1)
labs_df <- data.frame(study_id, lab_dt, lab_result, lab_abn_yn)
labs_df$lab_dt <- mdy(labs_df$lab_dt)
print(labs_df)

  study_id     lab_dt lab_result lab_abn_yn
1        1 1999-01-01        100          0
2        1 2000-03-01        200          0
3        1 2000-04-01         50          1
4        2 2001-02-01         25          1
5        3 2002-02-01         75          1
6        3 2002-03-01        100          0
7        3 2002-04-01         75          1

Notice that subject 1 has one lab check that is in the past prior to the first medication dose, subject 2 has more doses than lab checks, and subject 3 has fewer doses than lab checks.

I want R to determine the date of the most recent dose of the medication prior to the lab check so that I can calculate the interval between dose and lab check. The output would preserve the lab values and indicators. Preferably, lab checks prior to the first dose (negative time interval from dose to lab check) would be reported as NA, but I can easily filter out negative time intervals. I also know how to use lubridate to calculate time intervals, so need to add that to the explanation.

Desired output:

  study_id lab_dt     most_recent_dose_dt lab_result lab_abn_yn interval_months
     <dbl> <chr>      <chr>               <chr>           <dbl>           <dbl>
1        1 1999-01-01 NA                  NA                 NA          NA    
2        1 2000-03-01 2000-02-02          200                 0           0.966
3        1 2000-04-01 2000-03-01          50                  1           1    
4        2 2001-02-01 2001-01-01          25                  1           1    
5        3 2002-02-01 2002-01-01          75                  1           1    
6        3 2002-03-01 2002-01-01          100                 0           2    
7        3 2002-04-01 2002-01-01          75                  1           3      

I've tried a number of merging schemes, but none preserve all data. There are ~40,000 subjects, so doing this by hand isn't feasible. Any help much appreciated.

Upvotes: 2

Views: 176

Answers (2)

denis
denis

Reputation: 5673

There is a one line solution with data.table, using non-equi joins:

library(data.table)
# create data.tables
labs_df <-  setDT(labs_df)
doses_df <- setDT(doses_df)

# create join variable
doses_df[,join_time := dose_dt]
labs_df[,join_time := lab_dt]

# do nonequi join with a condition
doses_df[labs_df,on=.(study_id,join_time < join_time),mult = "last"]

   study_id    dose_dt  join_time     lab_dt lab_result lab_abn_yn
1:        1       <NA> 1999-01-01 1999-01-01        100          0
2:        1 2000-02-01 2000-03-01 2000-03-01        200          0
3:        1 2000-03-01 2000-04-01 2000-04-01         50          1
4:        2 2001-01-01 2001-02-01 2001-02-01         25          1
5:        3 2002-01-01 2002-02-01 2002-02-01         75          1
6:        3 2002-01-01 2002-03-01 2002-03-01        100          0
7:        3 2002-01-01 2002-04-01 2002-04-01         75          1

Here the idea is that you are merging doses_df on the study_id and join_time of labs_df that respect the condition join_time from the dose_df < joint_time from the labs_df.

I create a join_time column because the join only keep one of the two time column otherwise and changes names, so I always get confuse: if you do directly

doses_df[labs_df,on=.(study_id,dose_dt < lab_dt),mult = "last"]

It gives you

   study_id    dose_dt lab_result lab_abn_yn
1:        1 1999-01-01        100          0
2:        1 2000-03-01        200          0
3:        1 2000-04-01         50          1
4:        2 2001-02-01         25          1
5:        3 2002-02-01         75          1
6:        3 2002-03-01        100          0
7:        3 2002-04-01         75          1

which is right for the lab_result and other column, but confusing for the dose_dt column because it becomes the lab_dt column you did the merge on (the merge is like sub-setting the doses_dt column on the lab_dt values).

I actually wanted to use rolling joins in the beginning :

doses_df[labs_df,on=.(study_id,join_time),roll = T]

   study_id    dose_dt  join_time     lab_dt lab_result lab_abn_yn
1:        1       <NA> 1999-01-01 1999-01-01        100          0
2:        1 2000-03-01 2000-03-01 2000-03-01        200          0
3:        1 2000-03-01 2000-04-01 2000-04-01         50          1
4:        2 2001-02-01 2001-02-01 2001-02-01         25          1
5:        3 2002-01-01 2002-02-01 2002-02-01         75          1
6:        3 2002-01-01 2002-03-01 2002-03-01        100          0
7:        3 2002-01-01 2002-04-01 2002-04-01         75          1

but the problem is that it keeps date equal or inferior to.

I used this question to find the equi-joins solution, and I recommend this tutorial for the rolling joins. data.table is fast, and allow you to do in one line what you actually want (take the last line in the merge that comply with dose_dt < lab_dt).

Upvotes: 3

Calum You
Calum You

Reputation: 15082

We can do this in three steps:

  1. Filter to keep only doses that happened after the lab date
  2. Filter to keep only the dose that was the most recent for each lab date (which is now the most recent dose thanks to the first filter)
  3. Join back onto the list of labs to re-include lab dates that had no prior doses

Note that your sample data isn't quite as printed in the desired out, in that Feb 1st became Feb 2nd for the second observation.

library(tidyverse)
library(lubridate)
doses_df <- tibble(
  study_id = c(1, 1, 1, 2, 2, 3),
  dose_dt = mdy(c("1/1/00", "2/1/00", "3/1/00", "1/1/01", "2/1/01", "1/1/02"))
)
labs_df <- tibble(
  study_id = c(1, 1, 1, 2, 3, 3, 3),
  lab_dt = mdy(c("1/1/99", "3/1/00", "4/1/00", "2/1/01", "2/1/02", "3/1/02", "4/1/02")),
  lab_result = c(100, 200, 50, 25, 75, 100, 75),
  lab_abn_yn = c(0, 0, 1, 1, 1, 0, 1)
)

most_recent_doses <- labs_df %>%
  left_join(doses_df, by = "study_id") %>%
  group_by(study_id, lab_dt) %>%
  filter(dose_dt < lab_dt) %>%
  filter(dose_dt == max(dose_dt)) %>%
  select(study_id, lab_dt, dose_dt)

labs_df %>%
  left_join(most_recent_doses, by = c("study_id", "lab_dt")) %>%
  mutate(interval_months = interval(dose_dt, lab_dt) / months(1))
#> # A tibble: 7 x 6
#>   study_id lab_dt     lab_result lab_abn_yn dose_dt    interval_months
#>      <dbl> <date>          <dbl>      <dbl> <date>               <dbl>
#> 1        1 1999-01-01        100          0 NA                      NA
#> 2        1 2000-03-01        200          0 2000-02-01               1
#> 3        1 2000-04-01         50          1 2000-03-01               1
#> 4        2 2001-02-01         25          1 2001-01-01               1
#> 5        3 2002-02-01         75          1 2002-01-01               1
#> 6        3 2002-03-01        100          0 2002-01-01               2
#> 7        3 2002-04-01         75          1 2002-01-01               3

Created on 2019-10-16 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions