Mark Miller
Mark Miller

Reputation: 13103

microbenchmark with datatable, tapply, aggregate, ave and dplyr

I was attempting to compare speed of several approaches for obtaining summary statistics by group. However, I am get an error when running microbenchmark. The error states:

Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  x.'TRIAL_INDEX' is a character column being joined to i.'TRIAL_INDEX' which is type 'integer'. Character columns must join to factor or character columns.

I am not sure, but I think data.table changes an attribute of the variable TRIAL_INDEX. From searching Stack Overflow for similar questions, I guess there have been conflicts between some packages.

Is there a work-around, so I can perhaps change the attribute of TRIAL_INDEX back to integer or take other action so the microbenchmark function will work? Or maybe I am making an error I am not seeing.

Here is the code with the five functions I am attempting to compare. From running subsets of these functions I am impressed by how fast the ave function is.

library(microbenchmark)
library(dplyr)
library(data.table)

poo <- read.table(text = '
     TRIAL_INDEX     RIGHT_PUPIL_SIZE
          1                 10
          1                  8
          1                  6
          1                  4
          1                 NA
          2                  1
          2                  2
          2                 NA
          2                  4
          2                  5
', header = TRUE, stringsAsFactors = FALSE, na.strings = "NA")

tapply.function <- function(x) {

     my.summary <- as.data.frame(do.call("rbind", 
                   tapply(poo$RIGHT_PUPIL_SIZE, poo$TRIAL_INDEX, 
                   function(x) c(index.mean = mean(x, na.rm = TRUE),
                                   index.sd =   sd(x, na.rm = TRUE)))))

     my.summary$TRIAL_INDEX <- rownames(my.summary)

     poo2 <- merge(poo, my.summary, by = 'TRIAL_INDEX')

     return(poo2)

}

str(tapply.function(poo))

aggregate.function <- function(x) {

     my.summary <- with(poo, aggregate(RIGHT_PUPIL_SIZE, by = list(TRIAL_INDEX), 
                        FUN = function(x) {c( index.mean = mean(x, na.rm = TRUE), 
                                              index.sd   =   sd(x, na.rm = TRUE))}))

     my.summary <- do.call(data.frame, my.summary)

     colnames(my.summary) <- c('TRIAL_INDEX', 'index.mean', 'index.sd')

     poo2 <- merge(poo, my.summary, by = 'TRIAL_INDEX')

     return(poo2)

}

str(aggregate.function(poo))

ave.function <- function(x) {

     index.mean <- ave(poo$RIGHT_PUPIL_SIZE, poo$TRIAL_INDEX, FUN = function(x) mean(x, na.rm = TRUE))
     index.sd   <- ave(poo$RIGHT_PUPIL_SIZE, poo$TRIAL_INDEX, FUN = function(x)   sd(x, na.rm = TRUE))

     poo2 <- data.frame(poo, index.mean, index.sd)

     return(poo2)

}

str(ave.function(poo))

dplyr.function <- function(x) {

     my.summary <- poo %>%
         group_by(TRIAL_INDEX) %>% 
         summarise(index.mean = mean(RIGHT_PUPIL_SIZE, na.rm = TRUE),
                     index.sd =   sd(RIGHT_PUPIL_SIZE, na.rm = TRUE))

     poo2 <- merge(poo, as.data.frame(my.summary), by = 'TRIAL_INDEX')

     return(poo2)

}

str(dplyr.function(poo))

data.table.function <- function(x) {

     my.summary <- data.frame(setDT(poo)[, .(index.mean = mean(RIGHT_PUPIL_SIZE, na.rm = TRUE), 
                                               index.sd =   sd(RIGHT_PUPIL_SIZE, na.rm = TRUE)),
                          .(TRIAL_INDEX)])

     poo2 <- merge(poo, my.summary, by = 'TRIAL_INDEX')

     return(poo2)

}

str(data.table.function(poo))

# this does not work
microbenchmark(    tapply.function(poo),
                aggregate.function(poo),
                      ave.function(poo),
                    dplyr.function(poo), 
               data.table.function(poo), times = 1000)

Upvotes: 1

Views: 420

Answers (1)

Arun
Arun

Reputation: 118789

A simple test you could've done is to add a cat("In tapply"), cat("In ave") etc.. to your functions and run it again with times = 1L to debug.

Doing that, I get this:

> microbenchmark(    tapply.function(poo),
+                    aggregate.function(poo),
+                    ave.function(poo),
+                    dplyr.function(poo), 
+                    data.table.function(poo), times = 1)
In dplyr
In tapply
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  x.'TRIAL_INDEX' is a character column being joined to i.'TRIAL_INDEX' which is type 'integer'. Character columns must join to factor or character columns.

The error happens at tapply function.

Let's have a look at the first two lines in that function:

my.summary <- as.data.frame(do.call("rbind", tapply(poo$RIGHT_PUPIL_SIZE, 
                  poo$TRIAL_INDEX, function(x) c(index.mean = mean(x, na.rm = 
                        TRUE), index.sd =   sd(x, na.rm = TRUE)))))
my.summary$TRIAL_INDEX <- rownames(my.summary)

ding ding ding.. we've a winner...

str(my.summary)
# 'data.frame': 2 obs. of  3 variables:
#  $ index.mean : num  7 3
#  $ index.sd   : num  2.58 1.83
#  $ TRIAL_INDEX: chr  "1" "2" ## <~~~ char type

And that's the reason for the error message on the next merge. Why? Because you're using setDT(poo) (weird name for an object btw) in data.table.function() that modifies poo (?!?) by reference. And all subsequent tests use that object as a data.table.

At the end of your data.table.function(), before returning the ans, use setDF(poo), or use as.data.table(poo) in that function instead, and benchmark the time for as.data.table(poo) separately so that we can deduct the time for conversion from the timing of data.table function.

All of this aside, on 10 rows, you're very likely to just measure the overhead in type conversion from data.frame -> data.table -> data.frame. I'm not sure what meaningful conclusions you can get from ns/us timings (unless you're repeating this task >1,000 or 10,000 times).

Upvotes: 2

Related Questions