PeterD
PeterD

Reputation: 439

How to replace consecutive zeros in a data frame row with NA without using a for loop?

Given the data frame below:

df <- data.frame(v1 = c(3, 0, 5, 1, 0),
                 v2 = c(2, 0, 0, 0, 0),
                 v3 = c(0, 0, 3, 0, 0),
                 v4 = c(0, 0, 0, 2, 0),
                 v5 = c(0, 0, 0, 0, 0),
                 v6 = c(0, 0, 0, 0, 7))
df
  v1 v2 v3 v4 v5 v6
1  3  2  0  0  0  0
2  0  0  0  0  0  0
3  5  0  3  0  0  0
4  1  0  0  2  0  0
5  0  0  0  0  0  7

The desired result is the following data frame:

  v1 v2 v3 v4 v5 v6
1  3  2 NA NA NA NA
2 NA NA NA NA NA NA
3  5  0  3 NA NA NA
4  1  0  0  2 NA NA
5  0  0  0  0  0  7 

I'd like to replace all consecutive zeros in each row with NA under the condition that, looking at each row from left to right, there exists no non-zero number further down the row.

I've written a for loop to achieve this result, but this is really slow for a larger data frame:

for(i in 1:nrow(df)) {
  for (j in 1:ncol(df)){
    if ((df[i,j] == 0) & (apply(df[j:ncol(df)], 1, sum)[i] == 0)){
      df[i,j] <- NA
    }
  }
}     

I'd like a more efficient solution.

Upvotes: 2

Views: 259

Answers (3)

Dominic van Essen
Dominic van Essen

Reputation: 872

This solution uses apply to loop over the rows, but only vectorized functions within each row, so it effectively removes one non-vectorized loop.

trailing_zeros_to_NA=function(df){
    l=ncol(df)
    t(apply(df,1,function(row) {
        row=rev(row)
        row[seq_len(c(which(row!=0)-1,l)[1])]=NA
        rev(row)
    }))
}     

Speed-up is about 20x.

original=function(df){
    for(i in 1:nrow(df)) {
        for (j in 1:ncol(df)){
            if ((df[i,j] == 0) & (apply(df[j:ncol(df)], 1, sum)[i] == 0)){
            df[i,j] <- NA
            }
        }
    }
    df
}
microbenchmark(original(df),trailing_zeros_to_NA(df))
# Unit: microseconds
#                      expr      min        lq      mean    median       uq       max neval
#              original(df) 4672.029 5003.6895 5857.7659 5422.5345 6318.148 10676.192   100
#  trailing_zeros_to_NA(df)  218.482  241.5945  312.7013  267.5695  315.370  2545.555   100

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389135

There should be a more efficient way to do this but here is one attempt using apply :

t(apply(df, 1, function(x) {
   inds <- suppressWarnings(pmax(max(which(x != 0)), 0))
   if(inds < length(x))  x[(inds + 1):length(x)] <- NA
   x
}))

Another solution could be :

#Get the last non-zero column index for each row using `max.col`
#cbind it with row index
inds <- cbind(1:nrow(df), max.col(df != 0, ties.method = 'last'))
#Remove rows where the non-zero values is at last column
inds <- inds[!(inds[, 2] == ncol(df) & rowSums(df != 0) > 0), ]
inds[inds[, 2] == 6, 2] <- 0
#Create a sequence between last non-zero value to last value in column
#for each row and replace it with NA
df[do.call(rbind, Map(function(x, y) 
          cbind(x, y:ncol(df)), inds[, 1], inds[, 2] + 1))] <- NA

Upvotes: 2

Simone Bianchi
Simone Bianchi

Reputation: 143

Same here, there must be a more elegant way, but you could use data.table and run a function for each individual row.

require(data.table)
dt <- data.table(ID = seq(1,5,1), # add unique ID
                 v1 = c(3, 0, 5, 1, 0),
                 v2 = c(2, 0, 0, 0, 0),
                 v3 = c(0, 0, 3, 0, 0),
                 v4 = c(0, 0, 0, 2, 0),
                 v5 = c(0, 0, 0, 0, 0),
                 v6 = c(0, 0, 0, 0, 7))

dt[, v1:= ifelse(v1==0 & sum(v2,v3,v4,v5,v6)==0,NA,v1),by=ID]
dt[, v2:= ifelse(v2==0 & sum(v3,v4,v5,v6)==0,   NA,v2),by=ID]
dt[, v3:= ifelse(v3==0 & sum(v4,v5,v6)==0,      NA,v3),by=ID]
dt[, v4:= ifelse(v4==0 & sum(v5,v6)==0,         NA,v4),by=ID]
dt[, v5:= ifelse(v5==0 & sum(v6)==0,            NA,v5),by=ID]
dt[, v6:= ifelse(v5==0 ,                        NA,v6),by=ID]

Upvotes: 0

Related Questions