TexasMed
TexasMed

Reputation: 35

Grouping based on an id and specific timeframe taken from multiple columns, and converting to long format

Using the example dataset, the summary for this problem is for each id, for each in_visit within id, create a unique id_2 for all out_visit instances that are within 365 days prior of in_visit date. In the example dataset for id=1, there is a single in_visit date, but multiple out_visit dates. Each out_visit date may (or may not) have codes in one of the 4 columns c1 - c4. No codes appear for in_visit dates. I would like to format this data to long format such that all of the codes that correspond to out_visits within 365 days prior to the in_visit, are put into a single column, and assigned a unique id_2. When there is more than a single in_visit date, as for id=2, then some codes in the c1 - c4 columns may be duplicated since they may be 365 days prior to more than a single in_visit. For example, the data for id=2, out_visit=10/6/2020, would appear twice for the same id but have two different id_2 values, in the final dataset. Thus, for every in_visit date for each id that has an in_visit date, there would be a unique id_2 for the final dataset. But, these might share some of the same data from columns c1-c4. I have tried to several methods but could not seem to create the unique id_2.

#initial dataset
ID <- c("1","1","1","1","1","1","1","2","2","2","2","2","2","2","2","2","3","3",
        "3","3","3","3","3","4","4","4","4","4","4","4","5","5","6")
out_visit <- c("","2021-03-25","2021-06-01","2021-01-01","2020-10-02",
               "2020-09-12","2020-02-06","","2021-04-25","2021-06-01","",
               "2021-01-01","2020-10-06","","2020-09-12","2019-02-06","",
               "2021-04-02","2021-08-01","2021-01-01","2020-10-02","2020-09-12",
               "2020-02-06","","2014-03-25","2015-06-01","2014-01-01","2018-10-02",
               "2014-09-12","2019-02-06","2020-06-05","2020-06-24","")
in_visit <- c("2021-03-17","","","","","","","2021-03-01","","","2020-11-02","",
              "","2020-09-12","","","2017-08-03","","","","","","","2021-03-17",
              "","","","","","","","","2021-02-02")
c1 <- c("","e23","e45","d55","r44","","r44","","e23","e45","","d55","r44","","",
        "r44","","e23","r44","q22","r44","w3","r44","","y6","i88","","r44","",
        "u77","y66","u77","")
c2 <- c("","","d44","c33","t55","","","","","d44","","c33","t55","","","","",
        "e24","d44","c33","t55","","","","","","","","","","","","")
c3 <- c("","","","e22","y55","","","","","","","e22","y55","","","","","e25","",
        "e22","y55","","","","","","","","","","","","")
c4 <- c("","","","","y66","","","","","","","","y66","","","","","e26","","e23",
        "y66","","","","","","","","","","","","")
df1 <- data.frame(ID,out_visit,in_visit,c1,c2,c3,c4)
#final dataset
ID <- c("1","1","1","1","1","1","1","2","2","2","2","2","2","2","2","2","2","2",
        "2","3","4","6")
ID_2 <- c("1","1","1","1","1","1","1","2","2","2","2","2","2","2","3","3","3",
          "3","4","5","6","7")
date <- c("2021-03-17","2021-03-17","2021-03-17","2021-03-17","2021-03-17",
          "2021-03-17","2021-03-17","2021-03-01","2021-03-01","2021-03-01",
          "2021-03-01","2021-03-01","2021-03-01","2021-03-01","2020-11-02",
          "2020-11-02","2020-11-02","2020-11-02","2020-09-12","2017-08-03",
          "2021-03-17","2021-02-02")
code <- c("d55","c33","e22","r44","t55","y55","y66","d55","c33","e22","r44",
          "t55","y55","y66","r44","t55","y55","y66","","","","")

df2 <- data.frame(ID,ID_2,date,code)

Upvotes: 0

Views: 59

Answers (1)

Ben
Ben

Reputation: 30474

I'm not entirely sure this captures what you are looking for, but hope this may be helpful. I wrote this out using additional steps initially for clarity.

I separated your in_visit information from out_visit with its associated codes (c1-c4); now they are in two data.frames.

The in_visit data.frame can have a date range (in_visit date and the date - 365 days).

The out_visit data.frame is put into long form with pivot_longer.

Then, using fuzzy_left_join from the fuzzyjoin package, you can join the two data.frames based on the ID and out_visit dates that fit within your in_visit date range.

A separate id_2 was created based on unique out_visit dates for a given ID. It was unclear if multiple out_visit could exist for a given ID.

library(tidyverse)
library(fuzzyjoin)
library(data.table)

in_df1 <- df1 %>%
  filter(in_visit != "") %>%
  mutate(in_visit = as.Date(in_visit)) %>%
  mutate(in_visit_365 = in_visit - 365) %>%
  select(ID, in_visit, in_visit_365)
  
out_df1 <- df1 %>%
  filter(out_visit != "") %>%
  select(-in_visit) %>%
  pivot_longer(cols = starts_with("c")) %>%
  filter(value != "")

fuzzy_left_join(
  in_df1,
  out_df1,
  by = c("ID" = "ID", "in_visit" = "out_visit", "in_visit_365" = "out_visit"),
  match_fun = c(`==`, `>=`, `<=`)
) %>%
  filter(!is.na(value)) %>%
  mutate(id_2 = rleid(ID.x, in_visit)) %>%
  select(ID.x, id_2, in_visit, value)

Output

   ID.x id_2   in_visit value
1     1    1 2021-03-17   d55
2     1    1 2021-03-17   c33
3     1    1 2021-03-17   e22
4     1    1 2021-03-17   r44
5     1    1 2021-03-17   t55
6     1    1 2021-03-17   y55
7     1    1 2021-03-17   y66
8     2    2 2021-03-01   d55
9     2    2 2021-03-01   c33
10    2    2 2021-03-01   e22
11    2    2 2021-03-01   r44
12    2    2 2021-03-01   t55
13    2    2 2021-03-01   y55
14    2    2 2021-03-01   y66
15    2    3 2020-11-02   r44
16    2    3 2020-11-02   t55
17    2    3 2020-11-02   y55
18    2    3 2020-11-02   y66

Edit: Here is a data.table alternative that might work faster. This would substitute for the fuzzy_left_join.

setDT(in_df1, ID)
setDT(out_df1, ID)

out_df1$out_visit <- as.Date(out_df1$out_visit)

out_df1[in_df1, 
        .(ID, out_visit, value), 
        on = .(ID, out_visit <= in_visit,  out_visit >= in_visit_365), 
        nomatch = 0L][
  , ID_2 := rleid(ID, out_visit)][]

Output

    ID  out_visit value ID_2
 1:  1 2021-03-17   d55    1
 2:  1 2021-03-17   c33    1
 3:  1 2021-03-17   e22    1
 4:  1 2021-03-17   r44    1
 5:  1 2021-03-17   t55    1
 6:  1 2021-03-17   y55    1
 7:  1 2021-03-17   y66    1
 8:  2 2021-03-01   d55    2
 9:  2 2021-03-01   c33    2
10:  2 2021-03-01   e22    2
11:  2 2021-03-01   r44    2
12:  2 2021-03-01   t55    2
13:  2 2021-03-01   y55    2
14:  2 2021-03-01   y66    2
15:  2 2020-11-02   r44    3
16:  2 2020-11-02   t55    3
17:  2 2020-11-02   y55    3
18:  2 2020-11-02   y66    3

Upvotes: 1

Related Questions