Jack Fiskum
Jack Fiskum

Reputation: 19

R: Faster way to create a variable using earlier assigned values (SAS retain-type programming)

I have a data.frame in R, which consists of about 100k rows. I am using the following example data.frame to illustrate my problem. Imagine here that ID represents a patient, date is the date of testing for a disease, A indicates a positive test from test-type A, and B is a similar indicator but from test-type B. Every row will have at least one positive test and the dataset is sorted by ID and date.

data <- data.frame(ID=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
                   date=as.Date(c("2022-01-01","2022-01-05","2022-02-17","2022-05-13",
                                "2022-09-20","2021-02-02","2021-02-06","2021-04-18",
                                "2021-04-19","2022-08-21","2020-01-01","2020-03-29",
                                "2020-04-03","2020-04-04","2022-01-01")),
                   A=c(1,1,0,1,0,0,0,1,0,1,0,0,0,0,0),
                   B=c(0,0,1,0,1,1,1,0,1,0,1,1,1,1,1))

data

   ID       date A B
1   1 2022-01-01 1 0
2   1 2022-01-05 1 0
3   1 2022-02-17 0 1
4   1 2022-05-13 1 0
5   1 2022-09-20 0 1
6   2 2021-02-02 0 1
7   2 2021-02-06 0 1
8   2 2021-04-18 1 0
9   2 2021-04-19 0 1
10  2 2022-08-21 1 0
11  3 2020-01-01 0 1
12  3 2020-03-29 0 1
13  3 2020-04-03 0 1
14  3 2020-04-04 0 1
15  3 2022-01-01 0 1

I have the following rule to determine dates/counts (infec_date and n_infec) of consecutive infections for a patient:

  1. The first date value for a patient will be their initial infection (infec_date=date and n_infec=1).
  2. If A==1 and date is 45 days or more after a previous infection, that is a new infection (add 1 to n_infec and assign date as the new infec_date).
  3. Otherwise, if B==1 and date is 90 days or more after a previous infection, that is also a new infection (take the same actions as in 2. above).
  4. If no criteria for infection are met, then the most recent infec_date/n_infec are carried forward.

My output dataset using this rule will need to look like:

   ID       date A B infec_date n_infec
1   1 2022-01-01 1 0 2022-01-01       1
2   1 2022-01-05 1 0 2022-01-01       1
3   1 2022-02-17 0 1 2022-01-01       1
4   1 2022-05-13 1 0 2022-05-13       2
5   1 2022-09-20 0 1 2022-09-20       3
6   2 2021-02-02 0 1 2021-02-02       1
7   2 2021-02-06 0 1 2021-02-02       1
8   2 2021-04-18 1 0 2021-04-18       2
9   2 2021-04-19 0 1 2021-04-18       2
10  2 2022-08-21 1 0 2022-08-21       3
11  3 2020-01-01 0 1 2020-01-01       1
12  3 2020-03-29 0 1 2020-01-01       1
13  3 2020-04-03 0 1 2020-04-03       2
14  3 2020-04-04 0 1 2020-04-03       2
15  3 2022-01-01 0 1 2022-01-01       3

MY GOAL: Is to find a way to do this that is faster than having to iterate through each row and check a string of logic to set values manually. I would like to know if there is any functionality for ease/speed of this type of programming using packages dplyr, tidyr, data.table, or sqldf.

My current for loop I've been using in R is:

for(i in 1:nrow(data)){
  if(i==1){
    data[i,"infec_date"]=data[i,"date"]
    data[i,"n_infec"]=1
  }else if(data[i,"ID"]!=data[i-1,"ID"]){
    data[i,"infec_date"]=data[i,"date"]
    data[i,"n_infec"]=1
  }else{
    if(data[i,"A"]==1&data[i,"date"]>=data[i-1,"infec_date"]+45){
      data[i,"infec_date"]=data[i,"date"]
      data[i,"n_infec"]=data[i-1,"n_infec"]+1
    }else if(data[i,"B"]==1&data[i,"date"]>=(data[i-1,"infec_date"]+90)){
      data[i,"infec_date"]=data[i,"date"]
      data[i,"n_infec"]=data[i-1,"n_infec"]+1
    }else{
      data[i,"infec_date"]=data[i-1,"infec_date"]
      data[i,"n_infec"]=data[i-1,"n_infec"]
    }
  }
}

This gets slow and takes forever to run when dealing with 100k rows of data. I don't have access to SAS, but programming this in SAS would look like:

data new_data;
    set data;
    by id date;
    length infec_date n_infec 8.;
    format infec_date mmddyy10.;
    retain infec_date n_infec;
    if first.id then do;
        infec_date=date;
        n_infec=1;
        end;
    if A=1 and date>=infec_date+45 then do;
        infec_date=date;
        n_infec=n_infec+1;
        end;
    else if B=1 and date>=infec_date+90 then do;
        infec_date=date;
        n_infec=n_infec+1;
        end;
run;

Thanks in advance!

Upvotes: 1

Views: 290

Answers (1)

Jonni
Jonni

Reputation: 922

Thanks for the additional information about data, helpful to know more of the limitations. Below answer is still essentially looping through each row of each participant, but it is a bit more optimized than the for-loop/nested if-else you have listed in the question. I optimized by

  1. Splitting the dataframe up into lists by participant; iterating through lists is faster and this removes the need to track if participant session is "first"/duplicate

  2. Vectorized and preallocated a column that the new date would go into rather than adding to and expanding with each iteration

  3. Used ifelse in a user-defined function that took vector inputs, this required some data modification of making a "group" that contained whether the date was either an A or a B.

More ways to optimize here

For the code, I first pivoted longer to make A and B be in the same column and dropped the rows where, eg., B = 0, as these weren't relevant

data <- data %>%  
    pivot_longer(cols = c("A","B")) %>% 
    filter(value != 0) %>% 
    rename("group" = "name") #renamed to group for ease in explanation

#Select only variables of interest
df_list <- data %>% 
    select(ID, date,group)

#Make this into list
df_list <- split(df_list,df_list$ID)

Next wrote a user-defined function that will take in objects of the date, group, and running_infec. These will be pieces of information available in the loop run below.

condition_code <- function(date,group,running_infec){
    if(group == "A"){
        ifelse(date - running_infec < 45, running_infec, date)
    }else{
        ifelse(date - running_infec < 90, running_infec, date)
    }
}

Next is to loop through each list. For each list, create or predefine the new date column as the first date listed. This takes care of when it is a new participant as there is only one participant per list.

For each row of the list, objects are saved and used in the function condition_code

# Loop through each list in your series of lists
for (i in seq_along(df_list)) {
    # Get the data frame within the current list
    df <- df_list[[i]]
    
    # Create a new column to store the coded dates
    df$infec_date <- df$date[1]
    # Creates initial
    running_infec <- df$date[1]
    # Loop through each row of the data frame
    for (j in 1:nrow(df)) {
        # Get the date and group for the current row
        date <- df$date[j]
        group <- df$group[j]
        
        #Assigns value to infec_date
        infec_date <- as.Date(condition_code(date, group, running_infec), origin = "1970-01-01")
        # Save the new date in the new column of the data frame
        df$infec_date[j] <- infec_date
        #update running_infec value
        running_infec <- infec_date
    }
    
    # Replace the original data frame in the current list with the updated data frame
    df_list[[i]] <- df
}

After loop, bind the list back together for your dataframe, group by ID, and perform similar count of infection/new infection as shown with first answer. Adding the pivot_wider back in to get data output to match what you have shown in picture above but that may not be necessary.

#Get running count, pivot wider back to requested output
df_list %>% 
    bind_rows() %>%
    group_by(ID) %>% 
    mutate(running_count = case_when(infec_date == date ~ 1,
                                     TRUE ~ 0)) %>% 
    mutate(running_count = cumsum(running_count)) %>% 
    pivot_wider(names_from = group, values_from = group, values_fn = ~1, values_fill = 0) %>% 
    select(ID, date, A, B, infec_date, running_count)

enter image description here

[PREVIOUS ANSWER BUT OP COMMENT CLARIFIED THAT BELOW SOLUTION INCOMPLETE--IGNORE BELOW]

data %>% 
    group_by(ID) %>% 
    mutate(dup = duplicated(ID)) %>% 
    mutate(infection = case_when(A == 1 & (date - lag(date)) >=45 ~ 1,
                                 B == 1 & (date - lag(date)) >=90 ~ 1,
                                 dup == FALSE ~ 1,
                                 TRUE ~ 0)) %>% 
    mutate(infec_date = case_when(infection == 1 ~ date,
                                      infection == 0 & lag(infection) == 1 ~ lag(date),
                                      infection == 0 & lag(infection) == 0 ~ lag(date,n=2), 
                                      TRUE ~ NA)) %>% 
    mutate(n_infec = cumsum(infection)) %>% 
    select(-dup,-infection) %>% 
    ungroup()

Upvotes: 0

Related Questions