Reputation: 33
My question is very simple. I have a data frame with various numbers in each row, more than 100 columns. First column is always a non zero number. What I want to do is replace each nonzero number in each row (excluding the first column) with the first number in the row (the value of the first column)
I would think in the lines of an ifelse and a for loop that iterates through rows but there must be a simpler vectorised way to do it...
Upvotes: 0
Views: 1833
Reputation: 7455
Another approach is to use sapply
, which is more efficient than looping. Assuming your data is in a data frame df
:
df[,-1] <- sapply(df[,-1], function(x) {ind <- which(x!=0); x[ind] = df[ind,1]; return(x)})
Here, we are applying the function
over each and all columns of df
except for the first column. In the function
, x
is each of these columns in turn:
which
.x
to the corresponding values in the rows of the first column of df
.Note that the operations in the function are all "vectorized" over the column. That is, no looping over the rows of the column. The result from sapply
is a matrix of the processed columns, which replaces all columns of df
that are not the first column.
See this for an excellent review of the *apply
family of functions.
Hope this helps.
Upvotes: 1
Reputation: 73405
Suppose your data frame is dat
, I have a fully-vectorized solution for you:
mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
new_dat <- "colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))
Example
set.seed(0)
dat <- "colnames<-"(cbind.data.frame(1:5, matrix(sample(0:1, 25, TRUE), 5)),
c("val", letters[1:5]))
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 1 0 0 1
#3 3 0 1 0 1 0
#4 4 1 1 1 1 1
#5 5 1 1 0 0 0
My code above gives:
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 2 0 0 2
#3 3 0 3 0 3 0
#4 4 4 4 4 4 4
#5 5 5 5 0 0 0
You want a benchmark?
set.seed(0)
n <- 2000 ## use a 2000 * 2000 matrix
dat <- "colnames<-"(cbind.data.frame(1:n, matrix(sample(0:1, n * n, TRUE), n)),
c("val", paste0("x",1:n)))
## have to test my solution first, as aichao's solution overwrites `dat`
## my solution
system.time({mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
"colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))})
# user system elapsed
# 0.352 0.056 0.410
## solution by aichao
system.time(dat[,-1] <- sapply(dat[,-1], function(x) {ind <- which(x!=0); x[ind] = dat[ind,1]; x}))
# user system elapsed
# 7.804 0.108 7.919
My solution is 20 times faster!
Upvotes: 1
Reputation: 2077
Since you're data is not that big, I suggest you use a simple loop
for (i in 1:nrow(mydata))
{
for (j in 2:ncol(mydata)
{
mydata[i,j]<- ifelse(mydata[i,j]==0 ,0 ,mydata[i,1])
}
}
Upvotes: 1