Reputation: 617

R: apply vs do.call

I just read the profile of @David Arenburg, and found a bunch of useful tips for how to develop good R-programming skills/habits, and one especially struck me. I have always thought that the apply functions in R was the cornerstone of working with dataframes, but he writes:

If you are working with data.frames, forget there is a function called apply- whatever you do - don't use it. Especially with a margin of 1 (the only good usecase for this function is to operate over matrix columns- margin of 2).

Some good alternatives: ?do.call, ?pmax/pmin, ?max.col, ?rowSums/rowMeans/etc, the awesome matrixStats packages (for matrices), ?rowsum and many more

Could anybody explain this to me? Why are apply functions frowned upon?

Upvotes: 7

Answers (3)

G. Grothendieck

Reputation: 270268

apply(DF, 1, f) converts each row of DF to a vector and then passes that vector to f. If DF were a mix of strings and numbers then the row would be converted to a character vector before passing it to f so that, for example, apply(iris, 1, function(x) sum(x[-5])) will not work even though the row iris[i, -5] contains all numeric elements. The row is converted to character string and you can't sum character strings. On the other hand apply(iris[-5], 1, sum) will work the same as rowSums(iris[-5]).
if f produces a vector the result is a matrix and not another data frame; also, the result is the transpose of what you might expect. This
```
apply(BOD, 1, identity)
```
gives the following rather than giving BOD back:
```
       [,1] [,2] [,3] [,4] [,5] [,6]
Time    1.0  2.0    3    4  5.0  7.0
demand  8.3 10.3   19   16 15.6 19.8
```
Many years ago Hadley Wickham did post iapply which is idempotent in the sense that iapply(mat, 1, identity) returns mat, rather than t(mat), where mat is a matrix. More recently with his plyr package one can write:
```
library(plyr)
ddplyr(BOD, 1, identity)
```
and get BOD back as a data frame.

On the other hand apply(BOD, 1, sum) will give the same result as rowSums(BOD) and apply(BOD, 1, f) might be useful for functions f for which f produces a scalar and there is no counterpart such as in the sum / rowSums case. Also if f produces a vector and you don't mind a matrix result you can transpose the output of apply yourself and although ugly it would work.

Upvotes: 5

Novice

Reputation: 115

It is related to how R stores matrices and data frames*. As you may know, a data.frame is a list of vectors, that is, each column in the data.frame is a vector. Being a vectorized language, it is preferable to operate on vectors and that is the reason apply with margin of 2 is frowned upon: by doing so you will not be working on vectors, rather, you will be spanning across different vectors on each iteration.

As far as I know, using apply with margin 1 is not much different than using do.call. Although the latter might allow some more usage flexibility.

*This information should be somewhere in the manuals.

Upvotes: 1

r.user.05apr

Reputation: 5456

I think what the author means, is that you should use pre-built/vectorized functions (because it is easier), if you can and avoid apply (because in principle it is a for loop and takes longer):

library(microbenchmark)

d <- data.frame(a = rnorm(10, 10, 1),
                b = rnorm(10, 200, 1))

# bad - loop
microbenchmark(apply(d, 1, function(x) if (x[1] < x[2]) x[1] else x[2]))

# good - vectorized but same result
microbenchmark(pmin(d[[1]], d[[2]])) # use double brackets!

# edited:
# -------
# bad: lapply
microbenchmark(data.frame(lapply(d, round, 1)))

# good: do.call faster than lapply
microbenchmark(do.call("round", list(d, digits = 1)))

# --------------
# Unit: microseconds
#                                  expr     min    lq     mean  median      uq     max neval
# do.call("round", list(d, digits = 1)) 104.422 107.1 148.3419 134.767 184.524 332.009   100
#                            expr     min       lq     mean  median      uq      max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265   100
#
#                                  expr    min      lq    mean median       uq     max neval
# do.call("round", list(d, digits = 1)) 96.389 97.5055 113.075 98.175 105.5375 730.954   100
#                            expr     min       lq     mean  median      uq      max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265   100

Upvotes: 2

R: apply vs do.call

Answers (3)

Related Questions