Christopher B. L
Christopher B. L

Reputation: 245

Speed up tapply with changing groups

I am writing a function to calculate the difference in mean of two groups, but groups actually changes each time, it is simple to get the results, but the problem is that I have a rather large data set, so speed is the key. This is the "Readable" version, using Iris data as an example.

loopDif = function(Nsim) {
  change = numeric(Nsim)
  var = iris$Sepal.Length
  for (i in 1:Nsim){
    randomSpecies = sample(c("A","B"), length(var), replace=TRUE)
    change[i]  =  diff(tapply(var,  randomSpecies,  mean))
  }
  return(change)
}

> system.time(loopDif(10000))
   user  system elapsed 
   2.06    0.00    2.06 

The I tried to vectorise the code:

slowDif <- function(Nsim) {
  change = numeric(Nsim)
  randomSpecies = replicate(Nsim,sample(c("A","B"), length(var), replace=TRUE))
  var = iris$Sepal.Length
  change = diff(unlist(lapply(split(randomSpecies, col(randomSpecies)), 
                             function(x) unlist(lapply(split(var, x), mean)))))
  return(change)
}

> system.time(slowDif(10000))
   user  system elapsed 
   1.42    0.00    1.42

It is faster now, but still not faster enough, I hope to make it under 1 second, or even 0.75 seconds. The reason I am so obsessed with the time is because I have a deadline to meet, but my current code isn't fast enough.

I also tried profiling which tells me that unlist(lapply()) part is the bottleneck, but I have no idea how to rewrite it.

I really appreciate if anyone could provide me an alternative, even just suggestions. Thanks.

Upvotes: 1

Views: 92

Answers (1)

nicola
nicola

Reputation: 24480

Try this:

loopDif2 <- function(Nsim) {
    change <- numeric(Nsim)
    var <- iris$Sepal.Length
    nAgroup<-rbinom(Nsim,length(var),0.5)
    tot<-sum(var)
    for (i in 1:Nsim){
      change[i]<-sum(var[sample(length(var),nAgroup[i])])
    }
    change/nAgroup-(tot-change)/(length(var)-nAgroup)
}

In words: I first extract the number of elements of the A group, keeping the B group implicit. Then I extract the indices of the A group in each iteration. I evaluate the sum and divide the number of elements to get the mean. The other sum is obviously the total sum of the variable less the sum of the A group. Then the mean of the B group is evaluated.

Performance on my PC:

system.time(loopDif(10000))
# user  system elapsed 
#3.855   0.004   3.867 
system.time(loopDif2(10000))
# user  system elapsed 
#0.139   0.000   0.139 

Upvotes: 1

Related Questions