Reputation: 2429
Newbie question: This double loop on a data.frame of about 50K elements evaluates very slowly, taking over 30 seconds. I've read online that I should be using some form of the apply function to fix this but so far cannot get the code right. Starting with a first data.frame that has gain results in it, the goal is to get a second data.frame where only the values greater than the target are filled in and all others have 0.
This code works:
ExcessGain = function(Value, Target){
max(0,Value - Target)
}
Pcnt_O_O_x = data.frame()
for (j in 1:ncol(Pcnt_O_O)){
for (i in 1:nrow(Pcnt_O_O)){
Pcnt_O_O_x[i,j] = ExcessGain(Pcnt_O_O[i,j], GainTargetPcnt)
}
}
Can I speed this up somehow using an apply function instead of the inside loop?
Upvotes: 3
Views: 336
Reputation: 59970
Your function looks like it is just subtracting a target value from the value of each cell in your array. Any negative values are replace by 0. In that case you don't need any loops, you can just use R 's built in vectorisation to do this:
set.seed(123)
# If you have a data.frame of all numeric elements turn it into a matrix first
df <- as.matrix( data.frame( matrix( runif(25) , nrow = 5 ) ) )
target <- 0.5
df
# X1 X2 X3 X4 X5
#1 0.2875775 0.0455565 0.9568333 0.89982497 0.8895393
#2 0.7883051 0.5281055 0.4533342 0.24608773 0.6928034
#3 0.4089769 0.8924190 0.6775706 0.04205953 0.6405068
#4 0.8830174 0.5514350 0.5726334 0.32792072 0.9942698
#5 0.9404673 0.4566147 0.1029247 0.95450365 0.6557058
df2 <- df - target
df2
# X1 X2 X3 X4 X5
#1 -0.21242248 -0.45444350 0.45683335 0.3998250 0.3895393
#2 0.28830514 0.02810549 -0.04666584 -0.2539123 0.1928034
#3 -0.09102308 0.39241904 0.17757064 -0.4579405 0.1405068
#4 0.38301740 0.05143501 0.07263340 -0.1720793 0.4942698
#5 0.44046728 -0.04338526 -0.39707532 0.4545036 0.1557058
df2[ df2 < 0 ] <- 0
df2
# X1 X2 X3 X4 X5
#1 0.0000000 0.00000000 0.4568333 0.3998250 0.3895393
#2 0.2883051 0.02810549 0.0000000 0.0000000 0.1928034
#3 0.0000000 0.39241904 0.1775706 0.0000000 0.1405068
#4 0.3830174 0.05143501 0.0726334 0.0000000 0.4942698
#5 0.4404673 0.00000000 0.0000000 0.4545036 0.1557058
Here is some benchmarking to show the difference in speed in operating on a matrix
as opposed to operating on a data.frame
. f.df( df )
and f.m( m )
are two functions operating on a data.frame and matrix with 1 million elements resepctively:
require( microbenchmark )
microbenchmark( f.df( df ) , f.m( m ) , times = 10L )
#Unit: milliseconds
# expr min lq median uq max neval
# f.df(df) 6944.09808 9009.39684 9233.18528 9533.75089 10036.5963 10
# f.m(m) 37.26433 39.00189 40.46229 41.15626 130.6983 10
Operating on a matrix is two orders of magnitude quicker when the matrix is big.
If you really need to use an apply function, you can aplpy to every cell of an matrix like so:
m <- matrix( runif(25) , nrow = 5 )
target <- 0.5
apply( m , 1:2 , function(x) max(x - target , 0 ) )
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0.4575807 0.0000000 0.15935928 0.0000000 0.1948637
#[2,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
#[3,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
#[4,] 0.3912719 0.0000000 0.06155316 0.1533290 0.0000000
#[5,] 0.3228921 0.4697041 0.23554353 0.1352888 0.0000000
Upvotes: 3