R: Faster way to create a variable using earlier assigned values (SAS retain-type programming)

Question

I have a data.frame in R, which consists of about 100k rows. I am using the following example data.frame to illustrate my problem. Imagine here that ID represents a patient, date is the date of testing for a disease, A indicates a positive test from test-type A, and B is a similar indicator but from test-type B. Every row will have at least one positive test and the dataset is sorted by ID and date.

data <- data.frame(ID=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
                   date=as.Date(c("2022-01-01","2022-01-05","2022-02-17","2022-05-13",
                                "2022-09-20","2021-02-02","2021-02-06","2021-04-18",
                                "2021-04-19","2022-08-21","2020-01-01","2020-03-29",
                                "2020-04-03","2020-04-04","2022-01-01")),
                   A=c(1,1,0,1,0,0,0,1,0,1,0,0,0,0,0),
                   B=c(0,0,1,0,1,1,1,0,1,0,1,1,1,1,1))

data

   ID       date A B
1   1 2022-01-01 1 0
2   1 2022-01-05 1 0
3   1 2022-02-17 0 1
4   1 2022-05-13 1 0
5   1 2022-09-20 0 1
6   2 2021-02-02 0 1
7   2 2021-02-06 0 1
8   2 2021-04-18 1 0
9   2 2021-04-19 0 1
10  2 2022-08-21 1 0
11  3 2020-01-01 0 1
12  3 2020-03-29 0 1
13  3 2020-04-03 0 1
14  3 2020-04-04 0 1
15  3 2022-01-01 0 1

I have the following rule to determine dates/counts (infec_date and n_infec) of consecutive infections for a patient:

The first date value for a patient will be their initial infection (infec_date=date and n_infec=1).
If A==1 and date is 45 days or more after a previous infection, that is a new infection (add 1 to n_infec and assign date as the new infec_date).
Otherwise, if B==1 and date is 90 days or more after a previous infection, that is also a new infection (take the same actions as in 2. above).
If no criteria for infection are met, then the most recent infec_date/n_infec are carried forward.

My output dataset using this rule will need to look like:

   ID       date A B infec_date n_infec
1   1 2022-01-01 1 0 2022-01-01       1
2   1 2022-01-05 1 0 2022-01-01       1
3   1 2022-02-17 0 1 2022-01-01       1
4   1 2022-05-13 1 0 2022-05-13       2
5   1 2022-09-20 0 1 2022-09-20       3
6   2 2021-02-02 0 1 2021-02-02       1
7   2 2021-02-06 0 1 2021-02-02       1
8   2 2021-04-18 1 0 2021-04-18       2
9   2 2021-04-19 0 1 2021-04-18       2
10  2 2022-08-21 1 0 2022-08-21       3
11  3 2020-01-01 0 1 2020-01-01       1
12  3 2020-03-29 0 1 2020-01-01       1
13  3 2020-04-03 0 1 2020-04-03       2
14  3 2020-04-04 0 1 2020-04-03       2
15  3 2022-01-01 0 1 2022-01-01       3

MY GOAL: Is to find a way to do this that is faster than having to iterate through each row and check a string of logic to set values manually. I would like to know if there is any functionality for ease/speed of this type of programming using packages dplyr, tidyr, data.table, or sqldf.

My current for loop I've been using in R is:

for(i in 1:nrow(data)){
  if(i==1){
    data[i,"infec_date"]=data[i,"date"]
    data[i,"n_infec"]=1
  }else if(data[i,"ID"]!=data[i-1,"ID"]){
    data[i,"infec_date"]=data[i,"date"]
    data[i,"n_infec"]=1
  }else{
    if(data[i,"A"]==1&data[i,"date"]>=data[i-1,"infec_date"]+45){
      data[i,"infec_date"]=data[i,"date"]
      data[i,"n_infec"]=data[i-1,"n_infec"]+1
    }else if(data[i,"B"]==1&data[i,"date"]>=(data[i-1,"infec_date"]+90)){
      data[i,"infec_date"]=data[i,"date"]
      data[i,"n_infec"]=data[i-1,"n_infec"]+1
    }else{
      data[i,"infec_date"]=data[i-1,"infec_date"]
      data[i,"n_infec"]=data[i-1,"n_infec"]
    }
  }
}

This gets slow and takes forever to run when dealing with 100k rows of data. I don't have access to SAS, but programming this in SAS would look like:

data new_data;
    set data;
    by id date;
    length infec_date n_infec 8.;
    format infec_date mmddyy10.;
    retain infec_date n_infec;
    if first.id then do;
        infec_date=date;
        n_infec=1;
        end;
    if A=1 and date>=infec_date+45 then do;
        infec_date=date;
        n_infec=n_infec+1;
        end;
    else if B=1 and date>=infec_date+90 then do;
        infec_date=date;
        n_infec=n_infec+1;
        end;
run;

Thanks in advance!

R: Faster way to create a variable using earlier assigned values (SAS retain-type programming)

Answers (1)

Related Questions