Reputation: 19
I have a data.frame
in R, which consists of about 100k rows. I am using the following example data.frame
to illustrate my problem. Imagine here that ID
represents a patient, date
is the date of testing for a disease, A
indicates a positive test from test-type A, and B
is a similar indicator but from test-type B. Every row will have at least one positive test and the dataset is sorted by ID
and date
.
data <- data.frame(ID=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
date=as.Date(c("2022-01-01","2022-01-05","2022-02-17","2022-05-13",
"2022-09-20","2021-02-02","2021-02-06","2021-04-18",
"2021-04-19","2022-08-21","2020-01-01","2020-03-29",
"2020-04-03","2020-04-04","2022-01-01")),
A=c(1,1,0,1,0,0,0,1,0,1,0,0,0,0,0),
B=c(0,0,1,0,1,1,1,0,1,0,1,1,1,1,1))
data
ID date A B
1 1 2022-01-01 1 0
2 1 2022-01-05 1 0
3 1 2022-02-17 0 1
4 1 2022-05-13 1 0
5 1 2022-09-20 0 1
6 2 2021-02-02 0 1
7 2 2021-02-06 0 1
8 2 2021-04-18 1 0
9 2 2021-04-19 0 1
10 2 2022-08-21 1 0
11 3 2020-01-01 0 1
12 3 2020-03-29 0 1
13 3 2020-04-03 0 1
14 3 2020-04-04 0 1
15 3 2022-01-01 0 1
I have the following rule to determine dates/counts (infec_date
and n_infec
) of consecutive infections for a patient:
date
value for a patient will be their initial infection (infec_date
=date
and n_infec=1
).A==1
and date
is 45 days or more after a previous infection, that is a new infection (add 1 to n_infec
and assign date
as the new infec_date
).B==1
and date
is 90 days or more after a previous infection, that is also a new infection (take the same actions as in 2. above).infec_date
/n_infec
are carried forward.My output dataset using this rule will need to look like:
ID date A B infec_date n_infec
1 1 2022-01-01 1 0 2022-01-01 1
2 1 2022-01-05 1 0 2022-01-01 1
3 1 2022-02-17 0 1 2022-01-01 1
4 1 2022-05-13 1 0 2022-05-13 2
5 1 2022-09-20 0 1 2022-09-20 3
6 2 2021-02-02 0 1 2021-02-02 1
7 2 2021-02-06 0 1 2021-02-02 1
8 2 2021-04-18 1 0 2021-04-18 2
9 2 2021-04-19 0 1 2021-04-18 2
10 2 2022-08-21 1 0 2022-08-21 3
11 3 2020-01-01 0 1 2020-01-01 1
12 3 2020-03-29 0 1 2020-01-01 1
13 3 2020-04-03 0 1 2020-04-03 2
14 3 2020-04-04 0 1 2020-04-03 2
15 3 2022-01-01 0 1 2022-01-01 3
MY GOAL: Is to find a way to do this that is faster than having to iterate through each row and check a string of logic to set values manually. I would like to know if there is any functionality for ease/speed of this type of programming using packages dplyr
, tidyr
, data.table
, or sqldf
.
My current for
loop I've been using in R is:
for(i in 1:nrow(data)){
if(i==1){
data[i,"infec_date"]=data[i,"date"]
data[i,"n_infec"]=1
}else if(data[i,"ID"]!=data[i-1,"ID"]){
data[i,"infec_date"]=data[i,"date"]
data[i,"n_infec"]=1
}else{
if(data[i,"A"]==1&data[i,"date"]>=data[i-1,"infec_date"]+45){
data[i,"infec_date"]=data[i,"date"]
data[i,"n_infec"]=data[i-1,"n_infec"]+1
}else if(data[i,"B"]==1&data[i,"date"]>=(data[i-1,"infec_date"]+90)){
data[i,"infec_date"]=data[i,"date"]
data[i,"n_infec"]=data[i-1,"n_infec"]+1
}else{
data[i,"infec_date"]=data[i-1,"infec_date"]
data[i,"n_infec"]=data[i-1,"n_infec"]
}
}
}
This gets slow and takes forever to run when dealing with 100k rows of data. I don't have access to SAS, but programming this in SAS would look like:
data new_data;
set data;
by id date;
length infec_date n_infec 8.;
format infec_date mmddyy10.;
retain infec_date n_infec;
if first.id then do;
infec_date=date;
n_infec=1;
end;
if A=1 and date>=infec_date+45 then do;
infec_date=date;
n_infec=n_infec+1;
end;
else if B=1 and date>=infec_date+90 then do;
infec_date=date;
n_infec=n_infec+1;
end;
run;
Thanks in advance!
Upvotes: 1
Views: 290
Reputation: 922
Thanks for the additional information about data, helpful to know more of the limitations. Below answer is still essentially looping through each row of each participant, but it is a bit more optimized than the for-loop/nested if-else you have listed in the question. I optimized by
Splitting the dataframe up into lists by participant; iterating through lists is faster and this removes the need to track if participant session is "first"/duplicate
Vectorized and preallocated a column that the new date would go into rather than adding to and expanding with each iteration
Used ifelse in a user-defined function that took vector inputs, this required some data modification of making a "group" that contained whether the date was either an A or a B.
More ways to optimize here
For the code, I first pivoted longer to make A and B be in the same column and dropped the rows where, eg., B = 0, as these weren't relevant
data <- data %>%
pivot_longer(cols = c("A","B")) %>%
filter(value != 0) %>%
rename("group" = "name") #renamed to group for ease in explanation
#Select only variables of interest
df_list <- data %>%
select(ID, date,group)
#Make this into list
df_list <- split(df_list,df_list$ID)
Next wrote a user-defined function that will take in objects of the date, group, and running_infec. These will be pieces of information available in the loop run below.
condition_code <- function(date,group,running_infec){
if(group == "A"){
ifelse(date - running_infec < 45, running_infec, date)
}else{
ifelse(date - running_infec < 90, running_infec, date)
}
}
Next is to loop through each list. For each list, create or predefine the new date column as the first date listed. This takes care of when it is a new participant as there is only one participant per list.
For each row of the list, objects are saved and used in the function condition_code
# Loop through each list in your series of lists
for (i in seq_along(df_list)) {
# Get the data frame within the current list
df <- df_list[[i]]
# Create a new column to store the coded dates
df$infec_date <- df$date[1]
# Creates initial
running_infec <- df$date[1]
# Loop through each row of the data frame
for (j in 1:nrow(df)) {
# Get the date and group for the current row
date <- df$date[j]
group <- df$group[j]
#Assigns value to infec_date
infec_date <- as.Date(condition_code(date, group, running_infec), origin = "1970-01-01")
# Save the new date in the new column of the data frame
df$infec_date[j] <- infec_date
#update running_infec value
running_infec <- infec_date
}
# Replace the original data frame in the current list with the updated data frame
df_list[[i]] <- df
}
After loop, bind the list back together for your dataframe, group by ID, and perform similar count of infection/new infection as shown with first answer. Adding the pivot_wider back in to get data output to match what you have shown in picture above but that may not be necessary.
#Get running count, pivot wider back to requested output
df_list %>%
bind_rows() %>%
group_by(ID) %>%
mutate(running_count = case_when(infec_date == date ~ 1,
TRUE ~ 0)) %>%
mutate(running_count = cumsum(running_count)) %>%
pivot_wider(names_from = group, values_from = group, values_fn = ~1, values_fill = 0) %>%
select(ID, date, A, B, infec_date, running_count)
[PREVIOUS ANSWER BUT OP COMMENT CLARIFIED THAT BELOW SOLUTION INCOMPLETE--IGNORE BELOW]
data %>%
group_by(ID) %>%
mutate(dup = duplicated(ID)) %>%
mutate(infection = case_when(A == 1 & (date - lag(date)) >=45 ~ 1,
B == 1 & (date - lag(date)) >=90 ~ 1,
dup == FALSE ~ 1,
TRUE ~ 0)) %>%
mutate(infec_date = case_when(infection == 1 ~ date,
infection == 0 & lag(infection) == 1 ~ lag(date),
infection == 0 & lag(infection) == 0 ~ lag(date,n=2),
TRUE ~ NA)) %>%
mutate(n_infec = cumsum(infection)) %>%
select(-dup,-infection) %>%
ungroup()
Upvotes: 0