Reputation: 245
I am writing a function to calculate the difference in mean of two groups, but groups actually changes each time, it is simple to get the results, but the problem is that I have a rather large data set, so speed is the key. This is the "Readable" version, using Iris data as an example.
loopDif = function(Nsim) {
change = numeric(Nsim)
var = iris$Sepal.Length
for (i in 1:Nsim){
randomSpecies = sample(c("A","B"), length(var), replace=TRUE)
change[i] = diff(tapply(var, randomSpecies, mean))
}
return(change)
}
> system.time(loopDif(10000))
user system elapsed
2.06 0.00 2.06
The I tried to vectorise the code:
slowDif <- function(Nsim) {
change = numeric(Nsim)
randomSpecies = replicate(Nsim,sample(c("A","B"), length(var), replace=TRUE))
var = iris$Sepal.Length
change = diff(unlist(lapply(split(randomSpecies, col(randomSpecies)),
function(x) unlist(lapply(split(var, x), mean)))))
return(change)
}
> system.time(slowDif(10000))
user system elapsed
1.42 0.00 1.42
It is faster now, but still not faster enough, I hope to make it under 1 second, or even 0.75 seconds. The reason I am so obsessed with the time is because I have a deadline to meet, but my current code isn't fast enough.
I also tried profiling which tells me that unlist(lapply()) part is the bottleneck, but I have no idea how to rewrite it.
I really appreciate if anyone could provide me an alternative, even just suggestions. Thanks.
Upvotes: 1
Views: 92
Reputation: 24480
Try this:
loopDif2 <- function(Nsim) {
change <- numeric(Nsim)
var <- iris$Sepal.Length
nAgroup<-rbinom(Nsim,length(var),0.5)
tot<-sum(var)
for (i in 1:Nsim){
change[i]<-sum(var[sample(length(var),nAgroup[i])])
}
change/nAgroup-(tot-change)/(length(var)-nAgroup)
}
In words: I first extract the number of elements of the A
group, keeping the B
group implicit. Then I extract the indices of the A
group in each iteration. I evaluate the sum and divide the number of elements to get the mean. The other sum is obviously the total sum of the variable less the sum of the A
group. Then the mean of the B
group is evaluated.
Performance on my PC:
system.time(loopDif(10000))
# user system elapsed
#3.855 0.004 3.867
system.time(loopDif2(10000))
# user system elapsed
#0.139 0.000 0.139
Upvotes: 1