Reputation: 320
I am looking for advice as to whether I am using data.table
efficiently.
I have a data set which describes incidents, with one row per incident. On each row I have the date of the incident. Right now I just want to count how many incidents there are per year. I have done this using the code below, but it feels inefficient. I would appreciate any advice on how to improve this. (The data set is far bigger than shown below and I also have to do other similar, but more complex counts)
Create a list of dates from 2000 until end 2012:
dates <- seq(as.Date("1/1/2000", format="%d/%m/%Y"),
as.Date("31/12/2012", format="%d/%m/%Y"),
"day")
# Choose one million occurrences on various dates:
sampleDate <- sample(dates, 1000000, replace=TRUE)
# Create `data.table`, one line per incident:
library(data.table)
DT.dt <- data.table(Date=sampleDate, incident=1)
# Time how long it takes to count the number of indidents in each year:
system.time(result <- DT.dt[,count(format(Date,"%Y"))])
user system elapsed
11.83 0.10 11.95
result[1:3,]
x freq
2000 76930
2001 77101
2002 76666
So it works (I think) but I suspect there is a more efficient solution...
Upvotes: 2
Views: 3075
Reputation: 18602
When you are doing aggregate operations (grouping) with data.tables
, especially for large data sets, you should set the field you are grouping by as a key
(using setkeyv(DT, "your_key_field")
, etc...). Also, I can't speak definitively on the topic, but generally I think you will get better performance from using native data.table::
functions / operations within your data.table
object than you would when using other packages' functions, like plyr::count
for example. Below, I made a few data.table
objects - the first is identical to your example; the second adds a column Year
(instead of calculating format(Date,"%Y")
at the time of function execution), but sets Date
as the key
; and the third is the same as the second, except that it uses Year
as the key
. I also made a few functions (for benchmarking convenience) that do the grouping in different ways.
library(data.table)
library(plyr) # for 'count' function
library(microbenchmark)
##
dates <- seq.Date(
from=as.Date("2000-01-01"),
to=as.Date("2012-12-31"),
by="day")
##
set.seed(123)
sampleDate <- sample(
dates,
1e06,
replace=TRUE)
##
DT.dt <- data.table(
Date=sampleDate,
incident=1)
##
DT.dt2 <- copy(DT.dt)
DT.dt2[,Year:=format(Date,"%Y")]
setkeyv(DT.dt2,"Date")
##
DT.dt3 <- copy(DT.dt2)
setkeyv(DT.dt3,"Year")
##
> head(DT.dt,3)
Date incident
1: 2003-09-27 1
2: 2010-04-01 1
3: 2005-04-26 1
> head(DT.dt2,3)
Date incident Year
1: 2000-01-01 1 2000
2: 2000-01-01 1 2000
3: 2000-01-01 1 2000
> head(DT.dt3,3)
Date incident Year
1: 2000-01-01 1 2000
2: 2000-01-01 1 2000
3: 2000-01-01 1 2000
## your original method
f1 <- function(dt)
{
dt[,count(format(Date,"%Y"))]
}
## your method - using 'Year' column
f1.2 <- function(dt)
{
dt[,count(Year)]
}
## use 'Date' column; '.N' and
## 'by=' instead of 'count'
f2 <- function(dt)
{
dt[,.N,by=format(Date,"%Y")]
}
## use 'Year' and '.N','by='
f3 <- function(dt)
{
dt[,.N,by=Year]
}
##
Res <- microbenchmark(
f1(DT.dt),
f1.2(DT.dt2),
f1.2(DT.dt3),
f2(DT.dt2),
f3(DT.dt3))
##
> Res
Unit: milliseconds
expr min lq median uq max neval
f1(DT.dt) 478.941767 515.144253 557.428159 585.579862 706.8724 100
f1.2(DT.dt2) 98.722062 115.588034 126.332104 137.792116 223.4967 100
f1.2(DT.dt3) 97.475673 118.134788 125.836817 136.136156 238.2697 100
f2(DT.dt2) 352.767219 373.337958 387.759996 429.301164 542.1674 100
f3(DT.dt3) 7.912803 8.441159 8.736887 9.685267 76.9629 100
Observations:
Grouping by the precalculated field Year
instead of calculating
format(Date,"%Y")
at execution time was a definite improvement -
for both of the count
and .N
approaches. You can see this by
comparing the f1()
and f2()
times to the f1.2()
times.
The count
approach seemed to be a little slower than the .N
& 'by=' approach (f1()
compared to f2()
.
Year
and the native data.table
grouping .N
& by=
; f3()
was considerably faster than the other four timings. There are some pretty experience data.table
users on SO, certainly more so than myself, so there may be an even faster way to do this. All else aside, though, it's definitely a good idea to set a key
on your data.table
; and it certainly seems like you would be much better off precalculating a field like Year
than doing so "on the fly"; you can always delete it afterwards if you don't need it by using DT.dt[,Year:=NULL]
.
Also, you said you are trying to count the number of incident
s per year - and since your example data had incident = 1
for all rows, counting was the same as summing. But assuming your real data has different values of incident
, you could so something like this:
> DT.dt3[,list(Incidents=sum(incident)),by=Year]
Year Incidents
1: 2000 77214
2: 2001 77385
3: 2002 77080
4: 2003 76609
5: 2004 77197
6: 2005 76994
7: 2006 76560
8: 2007 76904
9: 2008 76786
10: 2009 76765
11: 2010 76675
12: 2011 76868
13: 2012 76963
(where I called setkeyv(DT.dt3,cols="Year")
above).
Upvotes: 4