Why is dplyr slower than plyr for data aggregation?

Question

Background question:

Suppose we have a data set like:

ID DRIVE_NUM FLAG
 1         A PASS
 2         A FAIL
 3         A PASS
-----------------
 4         B PASS
 5         B PASS
 6         B PASS
-----------------
 7         C PASS
 8         C FAIL
 9         C FAIL

I want to aggregate this data set by DRIVE_NUM by the following rule:

For a specific DRIVE_NUM group,

If there is any FAIL flag in the DRIVE_NUM group, I want the first row with the FAIL flag.

If there is no FAIL flag in the group, just take the first row in the group.

So, I shall get the following set:

  ID DRIVE_NUM FLAG
   2         A FAIL
   4         B PASS
   8         C FAIL

Update:

It seems that dplyr solution is even slower than plyr. Am I using anything inappropriately?

#Simulate Data

X = data.frame(
  group = rep(paste0("NO",1:10000),each=2),
  flag = sample(c("F","P"),20000,replace = TRUE),
  var = rnorm(20000)
)



library(plyr)
library(dplyr)

#plyr

START = proc.time()
X2 = ddply(X,.(flag),function(df) {
  if( sum(df$flag=="F")> 0){
    R = df[df$flag=="F",]
    if(nrow(R)>1) {R = R[1,]} else {R = R}
  } else{
    R = df[1,]
  }
  R
})
proc.time() - START   

#user  system elapsed 
#0.03    0.00    0.03 

#dplyr method 1

START = proc.time()
X %>%
  group_by(group) %>% 
  slice(which.min(flag))
proc.time() - START  

#user  system elapsed 
#0.22    0.02    0.23 

#dplyr method 2

START = proc.time()
X %>%
  group_by(group, flag) %>%
  slice(1) %>%
  group_by(group) %>% 
  slice(which.min(flag))
proc.time() - START  

#user  system elapsed 
#0.28    0.00    0.28

Is there a data.table version that can do it much faster than plyr?

akrun · Accepted Answer

Using data.table

library(data.table)
START = proc.time()
 X3 = as.data.table(X)[X[, .I[which.min(flag)] , by = group]$V1]
proc.time() - START
#   user  system elapsed 
#  0.00    0.02    0.02

Or use order

START = proc.time()
 X4 = as.data.table(X)[order(flag), .SD[1L] , by = group]
proc.time() - START
#    user  system elapsed 
#    0.02    0.00    0.01

The corresponding timings with the dplyr and plyr using OP's code are

#   user  system elapsed 
#  0.28    0.04    2.68 

#   user  system elapsed 
#  0.01    0.06    0.67

Also as commented by @Frank, a base R method timing is

START = proc.time()
Z = X[order(X$flag),]
X5 = with(Z, Z[tapply(seq(nrow(X)), group, head, 1), ])
proc.time() - START
#    user  system elapsed 
#    0.15    0.03    0.65

I am guessing the slice is slowering the dplyr.

Why is dplyr slower than plyr for data aggregation?

Answers (2)

Related Questions