Reputation: 439
Given the data frame below:
df <- data.frame(v1 = c(3, 0, 5, 1, 0),
v2 = c(2, 0, 0, 0, 0),
v3 = c(0, 0, 3, 0, 0),
v4 = c(0, 0, 0, 2, 0),
v5 = c(0, 0, 0, 0, 0),
v6 = c(0, 0, 0, 0, 7))
df
v1 v2 v3 v4 v5 v6
1 3 2 0 0 0 0
2 0 0 0 0 0 0
3 5 0 3 0 0 0
4 1 0 0 2 0 0
5 0 0 0 0 0 7
The desired result is the following data frame:
v1 v2 v3 v4 v5 v6
1 3 2 NA NA NA NA
2 NA NA NA NA NA NA
3 5 0 3 NA NA NA
4 1 0 0 2 NA NA
5 0 0 0 0 0 7
I'd like to replace all consecutive zeros in each row with NA
under the condition that, looking at each row from left to right, there exists no non-zero number further down the row.
I've written a for loop to achieve this result, but this is really slow for a larger data frame:
for(i in 1:nrow(df)) {
for (j in 1:ncol(df)){
if ((df[i,j] == 0) & (apply(df[j:ncol(df)], 1, sum)[i] == 0)){
df[i,j] <- NA
}
}
}
I'd like a more efficient solution.
Upvotes: 2
Views: 259
Reputation: 872
This solution uses apply to loop over the rows, but only vectorized functions within each row, so it effectively removes one non-vectorized loop.
trailing_zeros_to_NA=function(df){
l=ncol(df)
t(apply(df,1,function(row) {
row=rev(row)
row[seq_len(c(which(row!=0)-1,l)[1])]=NA
rev(row)
}))
}
Speed-up is about 20x.
original=function(df){
for(i in 1:nrow(df)) {
for (j in 1:ncol(df)){
if ((df[i,j] == 0) & (apply(df[j:ncol(df)], 1, sum)[i] == 0)){
df[i,j] <- NA
}
}
}
df
}
microbenchmark(original(df),trailing_zeros_to_NA(df))
# Unit: microseconds
# expr min lq mean median uq max neval
# original(df) 4672.029 5003.6895 5857.7659 5422.5345 6318.148 10676.192 100
# trailing_zeros_to_NA(df) 218.482 241.5945 312.7013 267.5695 315.370 2545.555 100
Upvotes: 1
Reputation: 389135
There should be a more efficient way to do this but here is one attempt using apply
:
t(apply(df, 1, function(x) {
inds <- suppressWarnings(pmax(max(which(x != 0)), 0))
if(inds < length(x)) x[(inds + 1):length(x)] <- NA
x
}))
Another solution could be :
#Get the last non-zero column index for each row using `max.col`
#cbind it with row index
inds <- cbind(1:nrow(df), max.col(df != 0, ties.method = 'last'))
#Remove rows where the non-zero values is at last column
inds <- inds[!(inds[, 2] == ncol(df) & rowSums(df != 0) > 0), ]
inds[inds[, 2] == 6, 2] <- 0
#Create a sequence between last non-zero value to last value in column
#for each row and replace it with NA
df[do.call(rbind, Map(function(x, y)
cbind(x, y:ncol(df)), inds[, 1], inds[, 2] + 1))] <- NA
Upvotes: 2
Reputation: 143
Same here, there must be a more elegant way, but you could use data.table and run a function for each individual row.
require(data.table)
dt <- data.table(ID = seq(1,5,1), # add unique ID
v1 = c(3, 0, 5, 1, 0),
v2 = c(2, 0, 0, 0, 0),
v3 = c(0, 0, 3, 0, 0),
v4 = c(0, 0, 0, 2, 0),
v5 = c(0, 0, 0, 0, 0),
v6 = c(0, 0, 0, 0, 7))
dt[, v1:= ifelse(v1==0 & sum(v2,v3,v4,v5,v6)==0,NA,v1),by=ID]
dt[, v2:= ifelse(v2==0 & sum(v3,v4,v5,v6)==0, NA,v2),by=ID]
dt[, v3:= ifelse(v3==0 & sum(v4,v5,v6)==0, NA,v3),by=ID]
dt[, v4:= ifelse(v4==0 & sum(v5,v6)==0, NA,v4),by=ID]
dt[, v5:= ifelse(v5==0 & sum(v6)==0, NA,v5),by=ID]
dt[, v6:= ifelse(v5==0 , NA,v6),by=ID]
Upvotes: 0