Effeciently replacing a variable number of NA values based on logical vector

Question

I am attempting to replace NA values in my data frame based on the logical return of one of the columns in the data frame.

#Creating random example data frame
a <- rbinom(1000,1,.5)
b <- rbinom(1000,1,.75)
c <- rbinom(1000,1,.25)
d <- rbinom(1000,1,.5)
e <- rbinom(1000,1,.5) # Will be the logical column
df <- cbind(a,b,c,d)

for(i in 1:1000){
  if(sum(df[i,1:4]) >2){
    df[i,1:4] <- NA
  }
}
# randomly replacing some of the NA to represent the observation data
df[sample(1:length(df), 100, replace=F)] <- 1

df <- cbind(df, e)

I am attempting to fill in the NAs with 0 when e == 1 while still retaining the random 1s I placed in the the other 4 columns (especially those where the rest of the values are NA). I've tried creating loops like:

for(i in 1:nrow(df)){
  if(df[,'e']==1){
    df[i,is.na(df[i,1:4])] <- 0 
  }
}

however that clears both my logical column and my observation data.

The data frame that I want to apply this to is large (2.8 million rows X 23 col) containing metadata and observation data so something that takes speed into account would be great.

akrun · Accepted Answer

We can do this with data.table

library(data.table)
df1 <- as.data.frame(df)
setDT(df1)
for(j in 1:4){
 set(df1, i = which(df1[['e']]==1 & is.na(df1[[j]])), j = j, value = 0)
}

It would be more efficient as we are using set. Based on the help page of set (?set) overhead of [.data.table is avoided by calling it.

As @thelatemail mentioned a compact base R option would be

df[,1:4][df[,"e"]==1 & is.na(df[,1:4])] <- 0

If the matrix is very big, the logical matrix would be big as well and that could potentially create memory-related issues.

Effeciently replacing a variable number of NA values based on logical vector

Answers (1)

Related Questions