R: Finding the closest of two dates according to an index column

Question

The purpose of this project is to measure the time interval between a medical lab check and the most recent dose of a medication. Each patient has a different number of subsequent doses of this medication, and the number of follow-up lab checks are also different for each subject.

The first data frame contains study_id and respective dose dates:

library(dplyr)
library(lubridate)

study_id<- c(1, 1, 1, 2, 2, 3)
dose_dt <- c('1/1/00', '2/1/00', '3/1/00', '1/1/01', '2/1/01', '1/1/02')
doses_df <- data.frame(study_id, dose_dt)
doses_df$dose_dt <- mdy(doses_df$dose_dt)
print(doses_df)

 study_id    dose_dt
1        1 2000-01-01
2        1 2000-02-01
3        1 2000-03-01
4        2 2001-01-01
5        2 2001-02-01
6        3 2002-01-01

The second data frame has matching study_id, this time with lab check dates and associated lab value and an indicator of abnormal (yes/no).

study_id <- c(1, 1, 1, 2, 3, 3, 3)
lab_dt <- c('1/1/99', '3/1/00', '4/1/00', '2/1/01', '2/1/02', '3/1/02', '4/1/02')
lab_result <- c(100, 200, 50, 25, 75, 100, 75)
lab_abn_yn <- c(0, 0, 1, 1, 1, 0, 1)
labs_df <- data.frame(study_id, lab_dt, lab_result, lab_abn_yn)
labs_df$lab_dt <- mdy(labs_df$lab_dt)
print(labs_df)

  study_id     lab_dt lab_result lab_abn_yn
1        1 1999-01-01        100          0
2        1 2000-03-01        200          0
3        1 2000-04-01         50          1
4        2 2001-02-01         25          1
5        3 2002-02-01         75          1
6        3 2002-03-01        100          0
7        3 2002-04-01         75          1

Notice that subject 1 has one lab check that is in the past prior to the first medication dose, subject 2 has more doses than lab checks, and subject 3 has fewer doses than lab checks.

I want R to determine the date of the most recent dose of the medication prior to the lab check so that I can calculate the interval between dose and lab check. The output would preserve the lab values and indicators. Preferably, lab checks prior to the first dose (negative time interval from dose to lab check) would be reported as NA, but I can easily filter out negative time intervals. I also know how to use lubridate to calculate time intervals, so need to add that to the explanation.

Desired output:

  study_id lab_dt     most_recent_dose_dt lab_result lab_abn_yn interval_months
                                                 
1        1 1999-01-01 NA                  NA                 NA          NA    
2        1 2000-03-01 2000-02-02          200                 0           0.966
3        1 2000-04-01 2000-03-01          50                  1           1    
4        2 2001-02-01 2001-01-01          25                  1           1    
5        3 2002-02-01 2002-01-01          75                  1           1    
6        3 2002-03-01 2002-01-01          100                 0           2    
7        3 2002-04-01 2002-01-01          75                  1           3

I've tried a number of merging schemes, but none preserve all data. There are ~40,000 subjects, so doing this by hand isn't feasible. Any help much appreciated.

Calum You · Accepted Answer

We can do this in three steps:

Filter to keep only doses that happened after the lab date
Filter to keep only the dose that was the most recent for each lab date (which is now the most recent dose thanks to the first filter)
Join back onto the list of labs to re-include lab dates that had no prior doses

Note that your sample data isn't quite as printed in the desired out, in that Feb 1st became Feb 2nd for the second observation.

library(tidyverse)
library(lubridate)
doses_df <- tibble(
  study_id = c(1, 1, 1, 2, 2, 3),
  dose_dt = mdy(c("1/1/00", "2/1/00", "3/1/00", "1/1/01", "2/1/01", "1/1/02"))
)
labs_df <- tibble(
  study_id = c(1, 1, 1, 2, 3, 3, 3),
  lab_dt = mdy(c("1/1/99", "3/1/00", "4/1/00", "2/1/01", "2/1/02", "3/1/02", "4/1/02")),
  lab_result = c(100, 200, 50, 25, 75, 100, 75),
  lab_abn_yn = c(0, 0, 1, 1, 1, 0, 1)
)

most_recent_doses <- labs_df %>%
  left_join(doses_df, by = "study_id") %>%
  group_by(study_id, lab_dt) %>%
  filter(dose_dt < lab_dt) %>%
  filter(dose_dt == max(dose_dt)) %>%
  select(study_id, lab_dt, dose_dt)

labs_df %>%
  left_join(most_recent_doses, by = c("study_id", "lab_dt")) %>%
  mutate(interval_months = interval(dose_dt, lab_dt) / months(1))
#> # A tibble: 7 x 6
#>   study_id lab_dt     lab_result lab_abn_yn dose_dt    interval_months
#>                                       
#> 1        1 1999-01-01        100          0 NA                      NA
#> 2        1 2000-03-01        200          0 2000-02-01               1
#> 3        1 2000-04-01         50          1 2000-03-01               1
#> 4        2 2001-02-01         25          1 2001-01-01               1
#> 5        3 2002-02-01         75          1 2002-01-01               1
#> 6        3 2002-03-01        100          0 2002-01-01               2
#> 7        3 2002-04-01         75          1 2002-01-01               3

^{Created on 2019-10-16 by the reprex package (v0.3.0)}

R: Finding the closest of two dates according to an index column

Answers (2)

Related Questions