Reputation: 35
Using the example dataset, the summary for this problem is for each id, for each in_visit within id, create a unique id_2 for all out_visit instances that are within 365 days prior of in_visit date. In the example dataset for id=1, there is a single in_visit date, but multiple out_visit dates. Each out_visit date may (or may not) have codes in one of the 4 columns c1 - c4. No codes appear for in_visit dates. I would like to format this data to long format such that all of the codes that correspond to out_visits within 365 days prior to the in_visit, are put into a single column, and assigned a unique id_2. When there is more than a single in_visit date, as for id=2, then some codes in the c1 - c4 columns may be duplicated since they may be 365 days prior to more than a single in_visit. For example, the data for id=2, out_visit=10/6/2020, would appear twice for the same id but have two different id_2 values, in the final dataset. Thus, for every in_visit date for each id that has an in_visit date, there would be a unique id_2 for the final dataset. But, these might share some of the same data from columns c1-c4. I have tried to several methods but could not seem to create the unique id_2.
#initial dataset
ID <- c("1","1","1","1","1","1","1","2","2","2","2","2","2","2","2","2","3","3",
"3","3","3","3","3","4","4","4","4","4","4","4","5","5","6")
out_visit <- c("","2021-03-25","2021-06-01","2021-01-01","2020-10-02",
"2020-09-12","2020-02-06","","2021-04-25","2021-06-01","",
"2021-01-01","2020-10-06","","2020-09-12","2019-02-06","",
"2021-04-02","2021-08-01","2021-01-01","2020-10-02","2020-09-12",
"2020-02-06","","2014-03-25","2015-06-01","2014-01-01","2018-10-02",
"2014-09-12","2019-02-06","2020-06-05","2020-06-24","")
in_visit <- c("2021-03-17","","","","","","","2021-03-01","","","2020-11-02","",
"","2020-09-12","","","2017-08-03","","","","","","","2021-03-17",
"","","","","","","","","2021-02-02")
c1 <- c("","e23","e45","d55","r44","","r44","","e23","e45","","d55","r44","","",
"r44","","e23","r44","q22","r44","w3","r44","","y6","i88","","r44","",
"u77","y66","u77","")
c2 <- c("","","d44","c33","t55","","","","","d44","","c33","t55","","","","",
"e24","d44","c33","t55","","","","","","","","","","","","")
c3 <- c("","","","e22","y55","","","","","","","e22","y55","","","","","e25","",
"e22","y55","","","","","","","","","","","","")
c4 <- c("","","","","y66","","","","","","","","y66","","","","","e26","","e23",
"y66","","","","","","","","","","","","")
df1 <- data.frame(ID,out_visit,in_visit,c1,c2,c3,c4)
#final dataset
ID <- c("1","1","1","1","1","1","1","2","2","2","2","2","2","2","2","2","2","2",
"2","3","4","6")
ID_2 <- c("1","1","1","1","1","1","1","2","2","2","2","2","2","2","3","3","3",
"3","4","5","6","7")
date <- c("2021-03-17","2021-03-17","2021-03-17","2021-03-17","2021-03-17",
"2021-03-17","2021-03-17","2021-03-01","2021-03-01","2021-03-01",
"2021-03-01","2021-03-01","2021-03-01","2021-03-01","2020-11-02",
"2020-11-02","2020-11-02","2020-11-02","2020-09-12","2017-08-03",
"2021-03-17","2021-02-02")
code <- c("d55","c33","e22","r44","t55","y55","y66","d55","c33","e22","r44",
"t55","y55","y66","r44","t55","y55","y66","","","","")
df2 <- data.frame(ID,ID_2,date,code)
Upvotes: 0
Views: 59
Reputation: 30474
I'm not entirely sure this captures what you are looking for, but hope this may be helpful. I wrote this out using additional steps initially for clarity.
I separated your in_visit
information from out_visit
with its associated codes (c1-c4); now they are in two data.frames.
The in_visit
data.frame can have a date range (in_visit
date and the date - 365 days).
The out_visit
data.frame is put into long form with pivot_longer
.
Then, using fuzzy_left_join
from the fuzzyjoin
package, you can join the two data.frames based on the ID
and out_visit
dates that fit within your in_visit
date range.
A separate id_2
was created based on unique out_visit
dates for a given ID
. It was unclear if multiple out_visit
could exist for a given ID
.
library(tidyverse)
library(fuzzyjoin)
library(data.table)
in_df1 <- df1 %>%
filter(in_visit != "") %>%
mutate(in_visit = as.Date(in_visit)) %>%
mutate(in_visit_365 = in_visit - 365) %>%
select(ID, in_visit, in_visit_365)
out_df1 <- df1 %>%
filter(out_visit != "") %>%
select(-in_visit) %>%
pivot_longer(cols = starts_with("c")) %>%
filter(value != "")
fuzzy_left_join(
in_df1,
out_df1,
by = c("ID" = "ID", "in_visit" = "out_visit", "in_visit_365" = "out_visit"),
match_fun = c(`==`, `>=`, `<=`)
) %>%
filter(!is.na(value)) %>%
mutate(id_2 = rleid(ID.x, in_visit)) %>%
select(ID.x, id_2, in_visit, value)
Output
ID.x id_2 in_visit value
1 1 1 2021-03-17 d55
2 1 1 2021-03-17 c33
3 1 1 2021-03-17 e22
4 1 1 2021-03-17 r44
5 1 1 2021-03-17 t55
6 1 1 2021-03-17 y55
7 1 1 2021-03-17 y66
8 2 2 2021-03-01 d55
9 2 2 2021-03-01 c33
10 2 2 2021-03-01 e22
11 2 2 2021-03-01 r44
12 2 2 2021-03-01 t55
13 2 2 2021-03-01 y55
14 2 2 2021-03-01 y66
15 2 3 2020-11-02 r44
16 2 3 2020-11-02 t55
17 2 3 2020-11-02 y55
18 2 3 2020-11-02 y66
Edit: Here is a data.table
alternative that might work faster. This would substitute for the fuzzy_left_join
.
setDT(in_df1, ID)
setDT(out_df1, ID)
out_df1$out_visit <- as.Date(out_df1$out_visit)
out_df1[in_df1,
.(ID, out_visit, value),
on = .(ID, out_visit <= in_visit, out_visit >= in_visit_365),
nomatch = 0L][
, ID_2 := rleid(ID, out_visit)][]
Output
ID out_visit value ID_2
1: 1 2021-03-17 d55 1
2: 1 2021-03-17 c33 1
3: 1 2021-03-17 e22 1
4: 1 2021-03-17 r44 1
5: 1 2021-03-17 t55 1
6: 1 2021-03-17 y55 1
7: 1 2021-03-17 y66 1
8: 2 2021-03-01 d55 2
9: 2 2021-03-01 c33 2
10: 2 2021-03-01 e22 2
11: 2 2021-03-01 r44 2
12: 2 2021-03-01 t55 2
13: 2 2021-03-01 y55 2
14: 2 2021-03-01 y66 2
15: 2 2020-11-02 r44 3
16: 2 2020-11-02 t55 3
17: 2 2020-11-02 y55 3
18: 2 2020-11-02 y66 3
Upvotes: 1