Reputation: 16089
I have a large data table, (1 billion rows × 50 columns) similar to flights
from library(nycflights13)
, where multiple columns can be combined to form a date.
The code I am currently using to create this date-time column is:
library(data.table)
library(nycflights13)
library(fasttime)
flights <- as.data.table(flights)
flights[,DepDateTime := fastPOSIXct(paste0(year,
"-",
formatC(month, width = 2, format = "d", flag = "0"),
"-",
formatC(day, width = 2, format = "d", flag = "0"),
" ",
# replace e.g. 903 with '09:03:00'
gsub("([0-9]{2})([0-9]{2})", "\\1:\\2:00",
formatC(dep_time, width = 4,
format = "d", flag = "0")))
)]
For the flights
data, this takes around 0.6 s. Is there any way to improve this performance? I am interested in timing primarily; memory usage is a secondary concern.
Here is a candidate data table:
flights.big <-
data.table(year = sample(1980:2015, size = 1e9, replace = TRUE),
month = sample(1:12, size = 1e9, replace = TRUE),
day = sample(1:28, size = 1e9, replace = TRUE),
hour = sample(1:12, size = 1e9, replace = TRUE),
minute = sample(0:59, size = 1e9, replace = TRUE)
)
Upvotes: 2
Views: 335
Reputation: 16089
A considerable speed increase was achieve by using joins and sprintf
in a function (create_fn
). The increase is more modest for the smaller dataset:
library(data.table)
library(nycflights13)
library(fasttime)
library(microbenchmark)
library(ggplot2) # for autoplot
create_DepDateTime <- function(DT){
setkey(DT, year, month, day, dep_time)
unique_dates <- unique(DT[,list(year, month, day, dep_time)])
unique_dates[,DepDateTime := fastPOSIXct(sprintf("%d-%02d-%02d %s", year,
month,
day,
sub("([0-9]{2})([0-9]{2})",
"\\1:\\2:00",
sprintf("%04d", dep_time),
perl = TRUE)),
tz = "GMT")]
DT[unique_dates]
}
flights <- as.data.table(flights)
BENCHMARK <- function(){
flights[,DepDateTime := fastPOSIXct(paste0(year,
"-",
formatC(month, width = 2,
format = "d", flag = "0"),
"-",
formatC(day, width = 2,
format = "d", flag = "0"),
" ",
# replace e.g. 903 with '09:03:00'
gsub("([0-9]{2})([0-9]{2})",
"\\1:\\2:00",
formatC(dep_time,
width = 4,
format = "d",
flag = "0")))
)]
}
NGaffney_lubridate <- function(){
flights[,DepDateTime := lubridate::ymd_hm(paste(year,
month,
day,
stringr::str_pad(dep_time,
width = 4,
side = "left",
pad = "0"),
sep = "-"))]
}
create_fn <- function(){
flights <- create_DepDateTime(flights)
}
autoplot(
microbenchmark(
BENCHMARK(),
NGaffney_lubridate(),
create_fn(),
times=50L
)
)
Upvotes: 0
Reputation: 1532
I used lubridate
and stringr
to get around a 25% performance boost on the flights data. Unfortunately I'm not currently on a computer which can handle a data set as big as your full set so hopefully it scales.
library(data.table)
library(nycflights13)
library(fasttime)
library(microbenchmark)
library(lubridate)
library(stringr)
flights <- as.data.table(flights)
op1 <- microbenchmark(
flights[,DepDateTime := fastPOSIXct(paste0(year,
"-",
formatC(month, width = 2, format = "d", flag = "0"),
"-",
formatC(day, width = 2, format = "d", flag = "0"),
" ",
# replace e.g. 903 with '09:03:00'
gsub("([0-9]{2})([0-9]{2})", "\\1:\\2:00",
formatC(dep_time, width = 4,
format = "d", flag = "0")))
)],
times=50L)
op2 <- microbenchmark(
flights[,DepDateTime := ymd_hm(paste(year,
month,
day,
str_pad(dep_time,
width = 4,
side = "left",
pad = "0"),
sep = "-"))],
times=50L)
The benchmarks on my computer are
>op1
min lq mean median uq max neval
3.385542 3.526347 3.739545 3.679273 3.855418 4.594314 50
>op2
min lq mean median uq max neval
2.536882 2.589711 2.733829 2.715038 2.835111 3.194575 50
Upvotes: 2