Evan
Evan

Reputation: 1

Downsample matrix in R?

My question is about how to improve the performance of function that downsamples from the columns of a matrix without replacement (a.k.a. "rarefication" of a matrix... I know there has been mention of this here, but I could not find a clear answer that a) does what I need; b) does it quickly).

Here is my function:

downsampled <- function(data,samplerate=0.8) {
    data.test <- apply(data,2,function(q) {
    names(q) <- rownames(data)
    samplepool <- character()
    for (i in names(q)) {
      samplepool <- append(samplepool,rep(i,times=q[i]))  
    }
    sampled <- sample(samplepool,size=samplerate*length(samplepool),replace = F)
    tab <- table(sampled)
    mat <- match(names(tab),names(q))
    toret=numeric(length <- length(q))
    names(toret) <- names(q)
    toret[mat] <- tab
    return(toret)
  })
return(data.test)
}

I need to be downsampling matrices with millions of entries. I find this is quite slow (here I'm using a 1000x1000 matrix, which is about 20-100x smaller than my typical data size):

mat <- matrix(sample(0:40,1000*1000,replace=T),ncol=1000,nrow=1000)
colnames(mat) <- paste0("C",1:1000)
rownames(mat) <- paste0("R",1:1000)
system.time(matd <- downsampled(mat,0.8))

##  user  system elapsed 
## 69.322  21.791  92.512 

Is there a faster/easier way to perform this operation that I haven't thought of?

Upvotes: 0

Views: 1771

Answers (2)

SlowLoris
SlowLoris

Reputation: 1005

I think you can make this dramatically faster. If I understand what you are trying to do correctly, you want to down-sample each cell of the matrix, such that if samplerate = 0.5 and the cell of the matrix is mat[i,j] = 5, then you want to sample up to 5 things where each thing has a 0.5 chance of being sampled.

To speed things up, rather than doing all these operations on columns of the matrix, you can just loop through each cell of the matrix, draw n things from that cell by using runif (e.g., if mat[i,j] = 5, you can generate 5 random numbers between 0 and 1, and then add up the number of values that are < samplerate), and finally add the number of things to a new matrix. I think this effectively achieves the same down-sampling scheme, but much more efficiently (both in terms of running time and lines of code).

# Sample matrix
set.seed(23)
n <- 1000
mat <- matrix(sample(0:10,n*n,replace=T),ncol=n,nrow=n)
colnames(mat) <- paste0("C",1:n)
rownames(mat) <- paste0("R",1:n)

# Old function
downsampled<-function(data,samplerate=0.8) {
    data.test<-apply(data,2,function(q){
    names(q)<-rownames(data)
    samplepool<-character()
    for (i in names(q)) {
      samplepool=append(samplepool,rep(i,times=q[i]))  
    }
    sampled=sample(samplepool,size=samplerate*length(samplepool),replace = F)
    tab=table(sampled)
    mat=match(names(tab),names(q))
    toret=numeric(length = length(q))
    names(toret)<-names(q)
    toret[mat]<-tab
    return(toret)
  })
return(data.test)
}

# New function
downsampled2 <- function(mat, samplerate=0.8) {
    new <- matrix(0, nrow(mat), ncol(mat))
    colnames(new) <- colnames(mat)
    rownames(new) <- rownames(mat)
    for (i in 1:nrow(mat)) {
        for (j in 1:ncol(mat)) {
            new[i,j] <- sum(runif(mat[i,j], 0, 1) < samplerate)
        }
    }
    return(new)
}

# Compare times
system.time(downsampled(mat,0.8))
##    user  system elapsed 
##  26.840   3.249  29.902 
system.time(downsampled2(mat,0.8))
##    user  system elapsed 
##   4.704   0.247   4.918 

Using an example 1000 X 1000 matrix, the new function I provided runs about 6 times faster.

Upvotes: 1

lmo
lmo

Reputation: 38520

One source of savings would be to remove the for loop that appends samplepool using rep. Here is a reproducible example:

myRows <- 1:5
names(myRows) <- letters[1:5]
# get the repeated values for sampling
samplepool <- rep(names(myRows), myRows)

Within your function, this would be

samplepool <- rep(names(q), q)  

Upvotes: 0

Related Questions