Reputation: 127
The purpose of this project is to measure the time interval between a medical lab check and the most recent dose of a medication. Each patient has a different number of subsequent doses of this medication, and the number of follow-up lab checks are also different for each subject.
The first data frame contains study_id and respective dose dates:
library(dplyr)
library(lubridate)
study_id<- c(1, 1, 1, 2, 2, 3)
dose_dt <- c('1/1/00', '2/1/00', '3/1/00', '1/1/01', '2/1/01', '1/1/02')
doses_df <- data.frame(study_id, dose_dt)
doses_df$dose_dt <- mdy(doses_df$dose_dt)
print(doses_df)
study_id dose_dt
1 1 2000-01-01
2 1 2000-02-01
3 1 2000-03-01
4 2 2001-01-01
5 2 2001-02-01
6 3 2002-01-01
The second data frame has matching study_id, this time with lab check dates and associated lab value and an indicator of abnormal (yes/no).
study_id <- c(1, 1, 1, 2, 3, 3, 3)
lab_dt <- c('1/1/99', '3/1/00', '4/1/00', '2/1/01', '2/1/02', '3/1/02', '4/1/02')
lab_result <- c(100, 200, 50, 25, 75, 100, 75)
lab_abn_yn <- c(0, 0, 1, 1, 1, 0, 1)
labs_df <- data.frame(study_id, lab_dt, lab_result, lab_abn_yn)
labs_df$lab_dt <- mdy(labs_df$lab_dt)
print(labs_df)
study_id lab_dt lab_result lab_abn_yn
1 1 1999-01-01 100 0
2 1 2000-03-01 200 0
3 1 2000-04-01 50 1
4 2 2001-02-01 25 1
5 3 2002-02-01 75 1
6 3 2002-03-01 100 0
7 3 2002-04-01 75 1
Notice that subject 1 has one lab check that is in the past prior to the first medication dose, subject 2 has more doses than lab checks, and subject 3 has fewer doses than lab checks.
I want R to determine the date of the most recent dose of the medication prior to the lab check so that I can calculate the interval between dose and lab check. The output would preserve the lab values and indicators. Preferably, lab checks prior to the first dose (negative time interval from dose to lab check) would be reported as NA, but I can easily filter out negative time intervals. I also know how to use lubridate to calculate time intervals, so need to add that to the explanation.
Desired output:
study_id lab_dt most_recent_dose_dt lab_result lab_abn_yn interval_months
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1 1999-01-01 NA NA NA NA
2 1 2000-03-01 2000-02-02 200 0 0.966
3 1 2000-04-01 2000-03-01 50 1 1
4 2 2001-02-01 2001-01-01 25 1 1
5 3 2002-02-01 2002-01-01 75 1 1
6 3 2002-03-01 2002-01-01 100 0 2
7 3 2002-04-01 2002-01-01 75 1 3
I've tried a number of merging schemes, but none preserve all data. There are ~40,000 subjects, so doing this by hand isn't feasible. Any help much appreciated.
Upvotes: 2
Views: 176
Reputation: 5673
There is a one line solution with data.table
, using non-equi
joins:
library(data.table)
# create data.tables
labs_df <- setDT(labs_df)
doses_df <- setDT(doses_df)
# create join variable
doses_df[,join_time := dose_dt]
labs_df[,join_time := lab_dt]
# do nonequi join with a condition
doses_df[labs_df,on=.(study_id,join_time < join_time),mult = "last"]
study_id dose_dt join_time lab_dt lab_result lab_abn_yn
1: 1 <NA> 1999-01-01 1999-01-01 100 0
2: 1 2000-02-01 2000-03-01 2000-03-01 200 0
3: 1 2000-03-01 2000-04-01 2000-04-01 50 1
4: 2 2001-01-01 2001-02-01 2001-02-01 25 1
5: 3 2002-01-01 2002-02-01 2002-02-01 75 1
6: 3 2002-01-01 2002-03-01 2002-03-01 100 0
7: 3 2002-01-01 2002-04-01 2002-04-01 75 1
Here the idea is that you are merging doses_df on the study_id
and join_time
of labs_df
that respect the condition join_time
from the dose_df < joint_time
from the labs_df.
I create a join_time
column because the join only keep one of the two time column otherwise and changes names, so I always get confuse: if you do directly
doses_df[labs_df,on=.(study_id,dose_dt < lab_dt),mult = "last"]
It gives you
study_id dose_dt lab_result lab_abn_yn
1: 1 1999-01-01 100 0
2: 1 2000-03-01 200 0
3: 1 2000-04-01 50 1
4: 2 2001-02-01 25 1
5: 3 2002-02-01 75 1
6: 3 2002-03-01 100 0
7: 3 2002-04-01 75 1
which is right for the lab_result
and other column, but confusing for the dose_dt
column because it becomes the lab_dt
column you did the merge on (the merge is like sub-setting the doses_dt
column on the lab_dt
values).
I actually wanted to use rolling joins in the beginning :
doses_df[labs_df,on=.(study_id,join_time),roll = T]
study_id dose_dt join_time lab_dt lab_result lab_abn_yn
1: 1 <NA> 1999-01-01 1999-01-01 100 0
2: 1 2000-03-01 2000-03-01 2000-03-01 200 0
3: 1 2000-03-01 2000-04-01 2000-04-01 50 1
4: 2 2001-02-01 2001-02-01 2001-02-01 25 1
5: 3 2002-01-01 2002-02-01 2002-02-01 75 1
6: 3 2002-01-01 2002-03-01 2002-03-01 100 0
7: 3 2002-01-01 2002-04-01 2002-04-01 75 1
but the problem is that it keeps date equal or inferior to.
I used this question to find the equi-joins solution, and I recommend this tutorial for the rolling joins.
data.table
is fast, and allow you to do in one line what you actually want (take the last line in the merge that comply with dose_dt < lab_dt
).
Upvotes: 3
Reputation: 15082
We can do this in three steps:
Note that your sample data isn't quite as printed in the desired out, in that Feb 1st became Feb 2nd for the second observation.
library(tidyverse)
library(lubridate)
doses_df <- tibble(
study_id = c(1, 1, 1, 2, 2, 3),
dose_dt = mdy(c("1/1/00", "2/1/00", "3/1/00", "1/1/01", "2/1/01", "1/1/02"))
)
labs_df <- tibble(
study_id = c(1, 1, 1, 2, 3, 3, 3),
lab_dt = mdy(c("1/1/99", "3/1/00", "4/1/00", "2/1/01", "2/1/02", "3/1/02", "4/1/02")),
lab_result = c(100, 200, 50, 25, 75, 100, 75),
lab_abn_yn = c(0, 0, 1, 1, 1, 0, 1)
)
most_recent_doses <- labs_df %>%
left_join(doses_df, by = "study_id") %>%
group_by(study_id, lab_dt) %>%
filter(dose_dt < lab_dt) %>%
filter(dose_dt == max(dose_dt)) %>%
select(study_id, lab_dt, dose_dt)
labs_df %>%
left_join(most_recent_doses, by = c("study_id", "lab_dt")) %>%
mutate(interval_months = interval(dose_dt, lab_dt) / months(1))
#> # A tibble: 7 x 6
#> study_id lab_dt lab_result lab_abn_yn dose_dt interval_months
#> <dbl> <date> <dbl> <dbl> <date> <dbl>
#> 1 1 1999-01-01 100 0 NA NA
#> 2 1 2000-03-01 200 0 2000-02-01 1
#> 3 1 2000-04-01 50 1 2000-03-01 1
#> 4 2 2001-02-01 25 1 2001-01-01 1
#> 5 3 2002-02-01 75 1 2002-01-01 1
#> 6 3 2002-03-01 100 0 2002-01-01 2
#> 7 3 2002-04-01 75 1 2002-01-01 3
Created on 2019-10-16 by the reprex package (v0.3.0)
Upvotes: 1