Alan Chalk
Alan Chalk

Reputation: 320

Efficient way to group data.table results by year

I am looking for advice as to whether I am using data.table efficiently.

I have a data set which describes incidents, with one row per incident. On each row I have the date of the incident. Right now I just want to count how many incidents there are per year. I have done this using the code below, but it feels inefficient. I would appreciate any advice on how to improve this. (The data set is far bigger than shown below and I also have to do other similar, but more complex counts)

Create a list of dates from 2000 until end 2012:

dates <- seq(as.Date("1/1/2000", format="%d/%m/%Y"), 
  as.Date("31/12/2012", format="%d/%m/%Y"), 
  "day")

# Choose one million occurrences on various dates:    

sampleDate <- sample(dates, 1000000, replace=TRUE)

# Create `data.table`, one line per incident:   

library(data.table)
DT.dt <- data.table(Date=sampleDate, incident=1)

# Time how long it takes to count the number of indidents in each year: 

system.time(result <- DT.dt[,count(format(Date,"%Y"))])

user  system elapsed 
11.83    0.10   11.95 

result[1:3,]
x    freq
2000 76930
2001 77101
2002 76666

So it works (I think) but I suspect there is a more efficient solution...

Upvotes: 2

Views: 3075

Answers (1)

nrussell
nrussell

Reputation: 18602

When you are doing aggregate operations (grouping) with data.tables, especially for large data sets, you should set the field you are grouping by as a key (using setkeyv(DT, "your_key_field"), etc...). Also, I can't speak definitively on the topic, but generally I think you will get better performance from using native data.table:: functions / operations within your data.table object than you would when using other packages' functions, like plyr::count for example. Below, I made a few data.table objects - the first is identical to your example; the second adds a column Year (instead of calculating format(Date,"%Y") at the time of function execution), but sets Date as the key; and the third is the same as the second, except that it uses Year as the key. I also made a few functions (for benchmarking convenience) that do the grouping in different ways.

library(data.table)
library(plyr) # for 'count' function
library(microbenchmark)
##
dates <- seq.Date(
  from=as.Date("2000-01-01"),
  to=as.Date("2012-12-31"),
  by="day")
##
set.seed(123)
sampleDate <- sample(
  dates,
  1e06,
  replace=TRUE)
##
DT.dt <- data.table(
  Date=sampleDate,
  incident=1)
##
DT.dt2 <- copy(DT.dt)
DT.dt2[,Year:=format(Date,"%Y")]
setkeyv(DT.dt2,"Date")
##
DT.dt3 <- copy(DT.dt2)
setkeyv(DT.dt3,"Year")
##
> head(DT.dt,3)
         Date incident
1: 2003-09-27        1
2: 2010-04-01        1
3: 2005-04-26        1
> head(DT.dt2,3)
         Date incident Year
1: 2000-01-01        1 2000
2: 2000-01-01        1 2000
3: 2000-01-01        1 2000
> head(DT.dt3,3)
         Date incident Year
1: 2000-01-01        1 2000
2: 2000-01-01        1 2000
3: 2000-01-01        1 2000

## your original method
f1 <- function(dt)
{
  dt[,count(format(Date,"%Y"))]
}
## your method - using 'Year' column
f1.2 <- function(dt)
{
  dt[,count(Year)]
}
## use 'Date' column; '.N' and 
## 'by=' instead of 'count'
f2 <- function(dt)
{
  dt[,.N,by=format(Date,"%Y")]
}
## use 'Year' and '.N','by='
f3 <- function(dt)
{
  dt[,.N,by=Year]
}
##
Res <- microbenchmark(
  f1(DT.dt),
  f1.2(DT.dt2),
  f1.2(DT.dt3),
  f2(DT.dt2),
  f3(DT.dt3))
##
> Res
Unit: milliseconds
         expr        min         lq     median         uq      max neval
    f1(DT.dt) 478.941767 515.144253 557.428159 585.579862 706.8724   100
 f1.2(DT.dt2)  98.722062 115.588034 126.332104 137.792116 223.4967   100
 f1.2(DT.dt3)  97.475673 118.134788 125.836817 136.136156 238.2697   100
   f2(DT.dt2) 352.767219 373.337958 387.759996 429.301164 542.1674   100
   f3(DT.dt3)   7.912803   8.441159   8.736887   9.685267  76.9629   100

Observations:

  1. Grouping by the precalculated field Year instead of calculating format(Date,"%Y") at execution time was a definite improvement - for both of the count and .N approaches. You can see this by comparing the f1() and f2() times to the f1.2() times.

  2. The count approach seemed to be a little slower than the .N & 'by=' approach (f1() compared to f2().

  3. The best approach by far was to use the precalculated field Year and the native data.table grouping .N & by=; f3() was considerably faster than the other four timings.

There are some pretty experience data.table users on SO, certainly more so than myself, so there may be an even faster way to do this. All else aside, though, it's definitely a good idea to set a key on your data.table; and it certainly seems like you would be much better off precalculating a field like Year than doing so "on the fly"; you can always delete it afterwards if you don't need it by using DT.dt[,Year:=NULL].

Also, you said you are trying to count the number of incidents per year - and since your example data had incident = 1 for all rows, counting was the same as summing. But assuming your real data has different values of incident, you could so something like this:

> DT.dt3[,list(Incidents=sum(incident)),by=Year]
    Year Incidents
 1: 2000     77214
 2: 2001     77385
 3: 2002     77080
 4: 2003     76609
 5: 2004     77197
 6: 2005     76994
 7: 2006     76560
 8: 2007     76904
 9: 2008     76786
10: 2009     76765
11: 2010     76675
12: 2011     76868
13: 2012     76963

(where I called setkeyv(DT.dt3,cols="Year") above).

Upvotes: 4

Related Questions