aimbotter21
aimbotter21

Reputation: 331

R - Transforming columns imported with fread() to date with various methods results in a FALSE evaluation of all.equal()

I have a data set as .csv file that I read in using the fread() function. This works perferctly fine. So far, date columns in the raw input file are stored as numeric in the format YYYYMMDD, such as f.ex. 20200131 or 20210308.

Up until now, I used data.table syntax to transform the type of data for a given column that was not "guessed" correctly by fread() like this:

    library(data.table)
    library(lubridate)
    library(anytime)

    import_by_fread <- function(fp) {
      fread(
        file             = fp,
        na.strings       = c("NULL", "NA"),
        encoding         = "UTF-8",
        header           = TRUE,
        keepLeadingZeros = TRUE,
        showProgress     = TRUE, 
        data.table       = TRUE
      )
   }

data.dt <- import_by_fread(file = "some_file_path")

data.dt[, date_col_1 := as.Date(as.character(date_col_1, format = "%Y%m%d")]

This transformed the numeric data in the supposedly date column from numeric to character and then subsequently date.

Now that I intended to speed things up by using the proposed anytime pkg in data.table's vignette, I wanted to check whether the resulting input was the same, and to my surprise the methods I used to check for equality of different date transformation methods sayed they weren't. A "manual" way to check this was to extract the particular column from the data.table and compare these vectors between date conversion methods and count the FALSE values from this vector comparison, but there is none. I reckon thus, the two date columns are not equal.

data.dt_1 <- import_by_fread(file = "some_file_path")
data.dt_2 <- import_by_fread(file = "same_fp")

identical(data.dt_1, data.dt_2) # TRUE
all.equal(data.dt_1, data.dt_2) # TRUE

data.dt_1[, date_col_1 := as.Date(as.character(date_col_1), format = "%Y%m%d"))]
data.dt_2[, date_col_1 := anydate(date_col_1)]

identical(data.dt_1[, date_col_1], data.dt_2[, date_col_1]) # FALSE
all.equal(data.dt_1[, date_col_1], data.dt_2[, date_col_1]) # FALSE

However, when I do this, all seems to have worked smoothly...

 # test for euquality of date transformation of all date columns
 lapply(as.vector(data.dt_1[, c(3,5,7,12,14,16:17,32)]), is.Date) # TRUE
 lapply(as.vector(data.dt_2[, c(3,5,7,12,14,16:17,32)]), is.Date) # TRUE

Is this somehow related to how dates are stored internally? How can I make sure that the >1m rows are equal after applying either date-type transformation methods? Thanks

Upvotes: 0

Views: 378

Answers (2)

Shorthand
Shorthand

Reputation: 176

Updating the code from Rui B. to add calcUnique::calcUnique() ... which is a generalized wrapper function that came out of working with Dirk Eddelbuettel on adding calcUnique = TRUE to anytime() ... here are the benchmarks of all six possibilities ...

library(microbenchmark)
library(ggplot2)
library(lubridate)
library(anytime)
library(calcUnique)

fun <- function(n){
    out <- lapply(seq.int(n), function(k){
        y <- rep(c(20200131, 20210308), 10^k)
        mb <- microbenchmark(
            base = as.Date(as.character(y), "%Y%m%d"),
            base_u = calcUnique(y, function(z){as.Date(as.character(z), "%Y%m%d")}),
            anytime = anydate(y),
            anytime_u = anydate(y, calcUnique = TRUE),
            lubridate = ymd(y),
            lubridate_u = calcUnique(y, ymd)
        )
        agg <- aggregate(time ~ expr, mb, median)
        agg$n <- k
        agg
    })
    do.call(rbind, out)
}

res <- fun(4)

ggplot(res, aes(n, time, colour = expr)) +
    geom_line(alpha = .75) +
    geom_point(alpha = .75) +
    scale_y_continuous(trans = "log10") +
    scale_colour_brewer(palette = "Dark2") +
    theme_minimal() +
    labs(title = "Text to Date Performance of Different Packages") +
    theme(plot.title.position = "plot")

As you can see, calcUnique(x, as.Date) performs almost identically to anytime::anydate(x, calcUnique = TRUE) ... lubridate::ymd() seems to scale quite well, but has quite a bit of overhead.

text to date performance

Upvotes: 2

Rui Barradas
Rui Barradas

Reputation: 76641

The difference comes from the fact that anytime::anydate sets an attribute tzone, unlike as.Date.

library(anytime)
library(lubridate)

x <- as.character(c(20200131, 20210308))
d1 <- as.Date(x, "%Y%m%d")
d2 <- anydate(x)

attributes(d1)
#$class
#[1] "Date"

attributes(d2)
#$class
#[1] "Date"
#
#$tzone
#[1] "Europe/Lisbon"

identical(d1, d2)
#[1] FALSE

all.equal(d1, d2)
#[1] "Attributes: < Length mismatch: comparison on first 1 components >"

d1 - d2
#Time differences in days
#[1] 0 0

d3 <- ymd(c(20200131, 20210308))
identical(d1, d3)
#[1] TRUE

Edit

Here are performance tests for different conversion functions and increasing vector sizes.

library(microbenchmark)
library(ggplot2)

fun <- function(n){
  out <- lapply(seq.int(n), function(k){
    y <- rep(c(20200131, 20210308), 10^k)
    mb <- microbenchmark(
      base = as.Date(as.character(y), "%Y%m%d"),
      anytime = anydate(y),
      anytime2 = anydate(y, calcUnique = TRUE),
      lubridate = ymd(y)
    )
    agg <- aggregate(time ~ expr, mb, median)
    agg$n <- k
    agg
  })
  do.call(rbind, out)
}

res <- fun(4)

ggplot(res, aes(n, time, colour = expr)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(trans = "log10")

enter image description here

Upvotes: 2

Related Questions