Reputation: 607
I just read the profile of @David Arenburg, and found a bunch of useful tips for how to develop good R-programming skills/habits, and one especially struck me. I have always thought that the apply functions in R was the cornerstone of working with dataframes, but he writes:
If you are working with data.frames, forget there is a function called apply- whatever you do - don't use it. Especially with a margin of 1 (the only good usecase for this function is to operate over matrix columns- margin of 2).
Some good alternatives: ?do.call, ?pmax/pmin, ?max.col, ?rowSums/rowMeans/etc, the awesome matrixStats packages (for matrices), ?rowsum and many more
Could anybody explain this to me? Why are apply functions frowned upon?
Upvotes: 7
Views: 2554
Reputation: 269471
apply(DF, 1, f)
converts each row of DF
to a vector and then passes that vector to f. If DF
were a mix of strings and numbers then the row would be converted to a character vector before passing it to f
so that, for example, apply(iris, 1, function(x) sum(x[-5]))
will not work even though the row iris[i, -5]
contains all numeric elements. The row is converted to character string and you can't sum character strings. On the other hand apply(iris[-5], 1, sum)
will work the same as rowSums(iris[-5])
.
if f
produces a vector the result is a matrix and not another data frame; also, the result is the transpose of what you might expect. This
apply(BOD, 1, identity)
gives the following rather than giving BOD
back:
[,1] [,2] [,3] [,4] [,5] [,6]
Time 1.0 2.0 3 4 5.0 7.0
demand 8.3 10.3 19 16 15.6 19.8
Many years ago Hadley Wickham did post iapply
which is idempotent in the sense that iapply(mat, 1, identity)
returns mat
, rather than t(mat)
, where mat
is a matrix. More recently with his plyr package one can write:
library(plyr)
ddplyr(BOD, 1, identity)
and get BOD
back as a data frame.
On the other hand apply(BOD, 1, sum)
will give the same result as rowSums(BOD)
and apply(BOD, 1, f)
might be useful for functions f
for which f
produces a scalar and there is no counterpart such as in the sum
/ rowSums
case. Also if f
produces a vector and you don't mind a matrix result you can transpose the output of apply
yourself and although ugly it would work.
Upvotes: 5
Reputation: 115
It is related to how R stores matrices and data frames*. As you may know, a data.frame
is a list
of vectors, that is, each column in the data.frame
is a vector. Being a vectorized language, it is preferable to operate on vectors and that is the reason apply
with margin of 2 is frowned upon: by doing so you will not be working on vectors, rather, you will be spanning across different vectors on each iteration.
As far as I know, using apply
with margin 1 is not much different than using do.call
. Although the latter might allow some more usage flexibility.
*This information should be somewhere in the manuals.
Upvotes: 1
Reputation: 5456
I think what the author means, is that you should use pre-built/vectorized functions (because it is easier), if you can and avoid apply (because in principle it is a for loop and takes longer):
library(microbenchmark)
d <- data.frame(a = rnorm(10, 10, 1),
b = rnorm(10, 200, 1))
# bad - loop
microbenchmark(apply(d, 1, function(x) if (x[1] < x[2]) x[1] else x[2]))
# good - vectorized but same result
microbenchmark(pmin(d[[1]], d[[2]])) # use double brackets!
# edited:
# -------
# bad: lapply
microbenchmark(data.frame(lapply(d, round, 1)))
# good: do.call faster than lapply
microbenchmark(do.call("round", list(d, digits = 1)))
# --------------
# Unit: microseconds
# expr min lq mean median uq max neval
# do.call("round", list(d, digits = 1)) 104.422 107.1 148.3419 134.767 184.524 332.009 100
# expr min lq mean median uq max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265 100
#
# expr min lq mean median uq max neval
# do.call("round", list(d, digits = 1)) 96.389 97.5055 113.075 98.175 105.5375 730.954 100
# expr min lq mean median uq max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265 100
Upvotes: 2