Reputation: 15
I have the tibbles df1 and df2 and I want to create df_temp from those using dplyr operations. The application is for implementing time-varying covariates in a survival model with delayed entry and start_time is age. Does anyone have a solution using dplyr or tmerge?
library(dplyr)
library(magrittr)
library(survival)
df1 =
tibble(id = c(1,2,3),
start_time = c(5,10,15),
stop_time = c(8,17,25),
event = c(1,1,0))
df2 = tibble(
id = c(1,2,3),
stop_time_cancer = c(6, NA, 20),
cancer_status = c(1,0,1))
df_temp <- tibble(
id = c(1,1,2,3,3),
start_time = c(5,6,10,15,20),
stop_time = c(6,8,17,20,25),
cancer_event = c(0, 1, 0, 0, 1),
event = c(0,1, 1, 0, 0)
)
Thanks!
I tried doing it using the tmerge function, but since I have delayed entry, I couldn't get it to work.
Upvotes: 0
Views: 283
Reputation: 160492
This currently uses fuzzyjoin
for the non-equi-join mechanics (required based on my interpretation of the problem-set). When dplyr-1.1.0 is released, this can likely be done with its join_by
functionality (ref: https://www.tidyverse.org/blog/2022/11/dplyr-1-1-0-is-coming-soon/#join-improvements).
# library(fuzzyjoin)
out <- fuzzyjoin::fuzzy_left_join(
df1, df2,
by = c(id="id", start_time="stop_time_cancer", stop_time="stop_time_cancer"),
match_fun = list(`==`, `<=`, `>=`)
) %>%
rowwise() %>%
summarize(
id = id.x,
start_time = c(start_time, na.omit(stop_time_cancer)),
stop_time = sort(c(na.omit(stop_time_cancer), stop_time)),
event = c(if (!is.na(stop_time_cancer)) 0, event),
cancer_event = c(0, if (!is.na(stop_time_cancer)) 1)
)
out
# # A tibble: 5 × 5
# id start_time stop_time event cancer_event
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 5 6 0 0
# 2 1 6 8 1 1
# 3 2 10 17 1 0
# 4 3 15 20 0 0
# 5 3 20 25 0 1
Verification:
all.equal(df_temp, out[,names(df_temp)])
# [1] TRUE
Upvotes: 3