R - Transforming columns imported with fread() to date with various methods results in a FALSE evaluation of all.equal()

Question

I have a data set as .csv file that I read in using the fread() function. This works perferctly fine. So far, date columns in the raw input file are stored as numeric in the format YYYYMMDD, such as f.ex. 20200131 or 20210308.

Up until now, I used data.table syntax to transform the type of data for a given column that was not "guessed" correctly by fread() like this:

    library(data.table)
    library(lubridate)
    library(anytime)

    import_by_fread <- function(fp) {
      fread(
        file             = fp,
        na.strings       = c("NULL", "NA"),
        encoding         = "UTF-8",
        header           = TRUE,
        keepLeadingZeros = TRUE,
        showProgress     = TRUE, 
        data.table       = TRUE
      )
   }

data.dt <- import_by_fread(file = "some_file_path")

data.dt[, date_col_1 := as.Date(as.character(date_col_1, format = "%Y%m%d")]

This transformed the numeric data in the supposedly date column from numeric to character and then subsequently date.

Now that I intended to speed things up by using the proposed anytime pkg in data.table's vignette, I wanted to check whether the resulting input was the same, and to my surprise the methods I used to check for equality of different date transformation methods sayed they weren't. A "manual" way to check this was to extract the particular column from the data.table and compare these vectors between date conversion methods and count the FALSE values from this vector comparison, but there is none. I reckon thus, the two date columns are not equal.

data.dt_1 <- import_by_fread(file = "some_file_path")
data.dt_2 <- import_by_fread(file = "same_fp")

identical(data.dt_1, data.dt_2) # TRUE
all.equal(data.dt_1, data.dt_2) # TRUE

data.dt_1[, date_col_1 := as.Date(as.character(date_col_1), format = "%Y%m%d"))]
data.dt_2[, date_col_1 := anydate(date_col_1)]

identical(data.dt_1[, date_col_1], data.dt_2[, date_col_1]) # FALSE
all.equal(data.dt_1[, date_col_1], data.dt_2[, date_col_1]) # FALSE

However, when I do this, all seems to have worked smoothly...

 # test for euquality of date transformation of all date columns
 lapply(as.vector(data.dt_1[, c(3,5,7,12,14,16:17,32)]), is.Date) # TRUE
 lapply(as.vector(data.dt_2[, c(3,5,7,12,14,16:17,32)]), is.Date) # TRUE

Is this somehow related to how dates are stored internally? How can I make sure that the >1m rows are equal after applying either date-type transformation methods? Thanks

Rui Barradas · Accepted Answer

The difference comes from the fact that anytime::anydate sets an attribute tzone, unlike as.Date.

library(anytime)
library(lubridate)

x <- as.character(c(20200131, 20210308))
d1 <- as.Date(x, "%Y%m%d")
d2 <- anydate(x)

attributes(d1)
#$class
#[1] "Date"

attributes(d2)
#$class
#[1] "Date"
#
#$tzone
#[1] "Europe/Lisbon"

identical(d1, d2)
#[1] FALSE

all.equal(d1, d2)
#[1] "Attributes: < Length mismatch: comparison on first 1 components >"

d1 - d2
#Time differences in days
#[1] 0 0

d3 <- ymd(c(20200131, 20210308))
identical(d1, d3)
#[1] TRUE

Edit

Here are performance tests for different conversion functions and increasing vector sizes.

library(microbenchmark)
library(ggplot2)

fun <- function(n){
  out <- lapply(seq.int(n), function(k){
    y <- rep(c(20200131, 20210308), 10^k)
    mb <- microbenchmark(
      base = as.Date(as.character(y), "%Y%m%d"),
      anytime = anydate(y),
      anytime2 = anydate(y, calcUnique = TRUE),
      lubridate = ymd(y)
    )
    agg <- aggregate(time ~ expr, mb, median)
    agg$n <- k
    agg
  })
  do.call(rbind, out)
}

res <- fun(4)

ggplot(res, aes(n, time, colour = expr)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(trans = "log10")

R - Transforming columns imported with fread() to date with various methods results in a FALSE evaluation of all.equal()

Answers (2)

Edit

Related Questions