Reputation: 6845
I have some written code to compute the correlation coefficient in R. However, I just found out that the 'boot' package offers a corr() functions which does the same job. Are built-in functions in R usually more efficient and faster than the equivalent ones we write from scratch?
Thank you.
Upvotes: 5
Views: 573
Reputation: 226532
One more point is that built-in R functions often have a lot of "wrapper" material that does error checking, rearranges data, etc.. For example, lm
and glm
each do a lot of stuff before handing over to lm.fit
and glm.fit
respectively for the actual number crunching. In your particular case, cor
calls .Internal(cor(x, y, na.method, FALSE))
for Pearson correlations. If (1) you really need speed and (2) you're willing to arrange the data appropriately yourself, and forgo the error-checking, you can sometimes save some time by calling the internal function yourself:
library(rbenchmark)
x <- y <- runif(1000)
benchmark(cor(x,y),.Internal(cor(x,y,4,FALSE)),replications=10000)
test replications elapsed relative user.self
1 cor(x, y) 10000 1.131 5.004425 1.136
2 .Internal(cor(x, y, 4, FALSE)) 10000 0.226 1.000000 0.224
But again this depends: we don't gain much at all when the matrices are large, as in the example above (so that the time spent error-checking relative to doing the computation is much larger) ...
x <- y <- rnorm(5e5)
benchmark(cor(x,y),.Internal(cor(x,y,4,FALSE)),replications=500)
test replications elapsed relative user.self
1 cor(x, y) 500 5.402 1.013889 5.384
2 .Internal(cor(x, y, 4, FALSE)) 500 5.328 1.000000 5.316
Upvotes: 2
Reputation: 18628
This more-less (i.e. not counting crappy code) boils down to a question whether certain procedure is implemented in R or as a C(++) or Fortran code -- if the function contains a call to .Internal
, .External
, .C
, .Fortran
or .Call
it means this is this second case and probably it will work faster. Note that this is orthogonal to the question weather the function is from base R or a package.
However, you must always remember that efficiency is a relative thing and must be always perceived in context of the whole task and weighted with the programmer's effort necessary to speed something up. It is an equal nonsense to reduce execution time from 1s to 10ms, rewrte everything to use base just because packages are evil or invest few hours in optimizing function A while 90% of actual execution time hides in function B.
Upvotes: 3
Reputation: 14450
Extending Chase's answer, I not only think that there is no single answer to this question but that this question is not that good. It is very unspecific. Please see here for which questions to ask.
Furthermore, I have the feeling the OP is not aware of the cor
function of base R, see ?cor
.
My answer: There are specialized functions that are extremely fast e.g., rowSums
in comparison to apply
with sum
. On the other hand there are build in slownesses that could be avoided (if you are willing to invest some time to get down to the basics) but are build in due to design decisions. Radford Neal is arguing on this corner, see e.g. one of his latest posts on the topic.
In sum, I guess the answer to this question nails down to what I think is the philosophy behind R: R is not the fastest horse in the race but definitely the one that achieves the most with the least code, if it is about data.
In general I think it is not that wrong to state, the more specialized a function is, the higher the possibility that it is very fast (and written in C or Fortran). The more general and abstract a function is the slower it generally is (compare the speeds of Hadley Wickham's plyr
with the base apply
family).
Upvotes: 2
Reputation: 69201
I don't think there is a single specific answer to this question as it will vary wildly depending on the specific function you are asking about. Some functions in contributed packages are added as a convenience and are simply wrappers around base functions. Others are added to extend the base functionality or to address some other perceived deficit in the base functions. Some as you suggest are added to improve computation time or to become more efficient. And others are added because the authors of the contributing packages feel that the solutions in base R are simply wrong in some way.
In the case of stats:::cor
and boot:::corr
, it looks like the latter adds a weighting capability. It does not necessarily appear to be any faster:
> dat <- matrix(rnorm(1e6), ncol = 2)
> system.time(
+ cor(dat[, 1],dat[, 2])
+ )
user system elapsed
0.01 0.00 0.02
> system.time(
+ corr(dat)
+ )
user system elapsed
0.11 0.00 0.11
Upvotes: 5