Leonardo Bhering
Leonardo Bhering

Reputation: 21

Imputation mean in a matrix in R

I have on matrix in R with 440 rows and 261 columns. There are some 0 values. In each row I need to change the 0 values to the mean of all the values. I tried to do it with the code below, but every time it changed with only the first mean value.

snp2<- read.table("snp2.txt",h=T)    
mean <- rowMeans(snp2)    
for(k in 1:nrow(snp2))    
{    
snp2[k==0]<-mean[k]  
}    

Upvotes: 2

Views: 265

Answers (2)

josliber
josliber

Reputation: 44309

Instead of looping through the rows, you could do this in one shot by identifying all the 0 indices in the matrix and replacing them with the appropriate row mean:

# Sample data
(mat <- matrix(c(0, 1, 2, 1, 0, 3, 11, 11, 11), nrow=3))
#      [,1] [,2] [,3]
# [1,]    0    1   11
# [2,]    1    0   11
# [3,]    2    3   11
(zeroes <- which(mat == 0, arr.ind=TRUE))
#      row col
# [1,]   1   1
# [2,]   2   2
mat[zeroes] <- rowMeans(mat)[zeroes[,"row"]]
mat
#      [,1] [,2] [,3]
# [1,]    4    1   11
# [2,]    1    4   11
# [3,]    2    3   11

While you could fix up your function to replace this missing values row-by-row, this will not be as efficient as the one-shot approach (in addition to being more typing):

josilber <- function(mat) {
  zeroes <- which(mat == 0, arr.ind=TRUE)
  mat[zeroes] <- rowMeans(mat)[zeroes[,"row"]]
  mat
}
OP.fixed <- function(mat) {
  means <- rowMeans(mat)    
  for(k in 1:nrow(mat)) {    
    mat[k,][mat[k,] == 0] <- means[k]  
  }
  mat
}
bgoldst <- function(m) ifelse(m==0,rowMeans({ mt <- m; mt[mt==0] <- NA; mt; },na.rm=T)[row(m)],m);
# 4400 x 2610 matrix
bigger <- matrix(sample(0:10, 4400*2610, replace=TRUE), nrow=4400)
all.equal(josilber(bigger), OP.fixed(bigger))
# [1] TRUE
# bgoldst differs because it takes means of non-zero values only
library(microbenchmark)
microbenchmark(josilber(bigger), OP.fixed(bigger), bgoldst(bigger), times=10)
# Unit: milliseconds
#              expr      min        lq      mean    median        uq       max neval
#  josilber(bigger)  262.541  382.0706  406.1107  395.3815  452.0872  532.4742    10
#  OP.fixed(bigger) 1033.071 1184.7288 1236.6245 1238.8298 1271.7677 1606.6737    10
#   bgoldst(bigger) 3820.044 4033.5826 4368.5848 4201.6302 4611.9697 5581.5514    10

For a fairly large matrix (4400 x 2610), the one-shot procedure is about 3 times quicker than the fixed up solution from the question and about 10 times faster than the one proposed by @bgoldst.

Upvotes: 1

bgoldst
bgoldst

Reputation: 35314

Here's a solution using ifelse(), assuming you want to exclude zeroes from the mean calculation:

NR <- 5; NC <- 5;
set.seed(1); m <- matrix(sample(c(rep(0,5),1:5),NR*NC,replace=T),NR);
m;
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    4    0    0    5
## [2,]    0    5    0    3    0
## [3,]    1    2    2    5    2
## [4,]    5    2    0    0    0
## [5,]    0    0    3    3    0
ifelse(m==0,rowMeans({ mt <- m; mt[mt==0] <- NA; mt; },na.rm=T)[row(m)],m);
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  4.5    4  4.5  4.5  5.0
## [2,]  4.0    5  4.0  3.0  4.0
## [3,]  1.0    2  2.0  5.0  2.0
## [4,]  5.0    2  3.5  3.5  3.5
## [5,]  3.0    3  3.0  3.0  3.0

Upvotes: 0

Related Questions