Reputation: 331
I have a data set as .csv file that I read in using the fread() function. This works perferctly fine. So far, date columns in the raw input file are stored as numeric in the format YYYYMMDD
, such as f.ex. 20200131 or 20210308.
Up until now, I used data.table syntax to transform the type of data for a given column that was not "guessed" correctly by fread()
like this:
library(data.table)
library(lubridate)
library(anytime)
import_by_fread <- function(fp) {
fread(
file = fp,
na.strings = c("NULL", "NA"),
encoding = "UTF-8",
header = TRUE,
keepLeadingZeros = TRUE,
showProgress = TRUE,
data.table = TRUE
)
}
data.dt <- import_by_fread(file = "some_file_path")
data.dt[, date_col_1 := as.Date(as.character(date_col_1, format = "%Y%m%d")]
This transformed the numeric data in the supposedly date column from numeric to character and then subsequently date.
Now that I intended to speed things up by using the proposed anytime pkg in data.table's vignette, I wanted to check whether the resulting input was the same, and to my surprise the methods I used to check for equality of different date transformation methods sayed they weren't. A "manual" way to check this was to extract the particular column from the data.table and compare these vectors between date conversion methods and count the FALSE values from this vector comparison, but there is none. I reckon thus, the two date columns are not equal.
data.dt_1 <- import_by_fread(file = "some_file_path")
data.dt_2 <- import_by_fread(file = "same_fp")
identical(data.dt_1, data.dt_2) # TRUE
all.equal(data.dt_1, data.dt_2) # TRUE
data.dt_1[, date_col_1 := as.Date(as.character(date_col_1), format = "%Y%m%d"))]
data.dt_2[, date_col_1 := anydate(date_col_1)]
identical(data.dt_1[, date_col_1], data.dt_2[, date_col_1]) # FALSE
all.equal(data.dt_1[, date_col_1], data.dt_2[, date_col_1]) # FALSE
However, when I do this, all seems to have worked smoothly...
# test for euquality of date transformation of all date columns
lapply(as.vector(data.dt_1[, c(3,5,7,12,14,16:17,32)]), is.Date) # TRUE
lapply(as.vector(data.dt_2[, c(3,5,7,12,14,16:17,32)]), is.Date) # TRUE
Is this somehow related to how dates are stored internally? How can I make sure that the >1m rows are equal after applying either date-type transformation methods? Thanks
Upvotes: 0
Views: 378
Reputation: 176
Updating the code from Rui B. to add calcUnique::calcUnique()
... which is a generalized wrapper function that came out of working with Dirk Eddelbuettel on adding calcUnique = TRUE
to anytime()
... here are the benchmarks of all six possibilities ...
library(microbenchmark)
library(ggplot2)
library(lubridate)
library(anytime)
library(calcUnique)
fun <- function(n){
out <- lapply(seq.int(n), function(k){
y <- rep(c(20200131, 20210308), 10^k)
mb <- microbenchmark(
base = as.Date(as.character(y), "%Y%m%d"),
base_u = calcUnique(y, function(z){as.Date(as.character(z), "%Y%m%d")}),
anytime = anydate(y),
anytime_u = anydate(y, calcUnique = TRUE),
lubridate = ymd(y),
lubridate_u = calcUnique(y, ymd)
)
agg <- aggregate(time ~ expr, mb, median)
agg$n <- k
agg
})
do.call(rbind, out)
}
res <- fun(4)
ggplot(res, aes(n, time, colour = expr)) +
geom_line(alpha = .75) +
geom_point(alpha = .75) +
scale_y_continuous(trans = "log10") +
scale_colour_brewer(palette = "Dark2") +
theme_minimal() +
labs(title = "Text to Date Performance of Different Packages") +
theme(plot.title.position = "plot")
As you can see, calcUnique(x, as.Date)
performs almost identically to anytime::anydate(x, calcUnique = TRUE)
... lubridate::ymd()
seems to scale quite well, but has quite a bit of overhead.
Upvotes: 2
Reputation: 76641
The difference comes from the fact that anytime::anydate
sets an attribute tzone
, unlike as.Date
.
library(anytime)
library(lubridate)
x <- as.character(c(20200131, 20210308))
d1 <- as.Date(x, "%Y%m%d")
d2 <- anydate(x)
attributes(d1)
#$class
#[1] "Date"
attributes(d2)
#$class
#[1] "Date"
#
#$tzone
#[1] "Europe/Lisbon"
identical(d1, d2)
#[1] FALSE
all.equal(d1, d2)
#[1] "Attributes: < Length mismatch: comparison on first 1 components >"
d1 - d2
#Time differences in days
#[1] 0 0
d3 <- ymd(c(20200131, 20210308))
identical(d1, d3)
#[1] TRUE
Here are performance tests for different conversion functions and increasing vector sizes.
library(microbenchmark)
library(ggplot2)
fun <- function(n){
out <- lapply(seq.int(n), function(k){
y <- rep(c(20200131, 20210308), 10^k)
mb <- microbenchmark(
base = as.Date(as.character(y), "%Y%m%d"),
anytime = anydate(y),
anytime2 = anydate(y, calcUnique = TRUE),
lubridate = ymd(y)
)
agg <- aggregate(time ~ expr, mb, median)
agg$n <- k
agg
})
do.call(rbind, out)
}
res <- fun(4)
ggplot(res, aes(n, time, colour = expr)) +
geom_line() +
geom_point() +
scale_y_continuous(trans = "log10")
Upvotes: 2