rbatt
rbatt

Reputation: 4807

Robustly parse variably formatted dates in R

I'm trying to convert characters to dates. The dates have heterogeneous formats, and I'd prefer to not have to code each date format separately (and specify which formats belong to which elements, through, e.g., grepl).

Here's my test data:

test <- c(
  "2012-11-11", "12-5-23", "12/5/86", "2015-12-16 1300", 
  "8/6/92 3:00", "11/6/14 4", "10/31/14 52", 
  "06/15/2014 14:37", "2/10/06", "95-06-26", "82-10-03"
)

Desired result:

as.POSIXct(c("2012-11-11 00:00:00 UTC", "2012-05-23 00:00:00 UTC", "1986-12-05 00:00:00 UTC", "2015-12-16 13:00:00 UTC", "1992-08-06 03:00:00 UTC", "2011-06-14 04:00:00 UTC", "2014-10-31 00:52:00 UTC", "2014-06-15 14:37:00 UTC", "2006-02-10 00:00:00 UTC", "1995-06-26 00:00:00 UTC", "1982-10-03 00:00:00 UTC"), tz="UTC")

I realize that some of the weird time in test (like 4, or 53) might be impossible unequivocally parse, but for those I mostly want to make sure that the random time doesn't screw up the date.

Here's my best attempt:

orders <- paste(rep(c("ymd", "mdy", "Ymd"),each=3), c("HM","H","M"))
lubridate::parse_date_time(test, orders=orders, truncated=2)

[1] "2012-11-11 00:00:00 UTC" "0012-05-23 00:00:00 UTC" "1986-12-05 00:00:00 UTC" "2015-12-16 13:00:00 UTC" "0092-08-06 03:00:00 UTC" "2011-06-14 04:00:00 UTC"
[7] "2014-10-31 00:52:00 UTC" "2014-06-15 14:37:00 UTC" "2006-02-10 00:00:00 UTC" "0095-06-26 00:00:00 UTC" "0082-10-03 00:00:00 UTC"

The problem is that it gets the century wrong on the 2-digit years. What surprising is that this works fine:

parse_date_time(test[2], orders=orders[1], truncated=2)
[1] "2012-05-23 UTC"

In ?parse_date_time there is a note:

NOTE: ymd family of functions are based on strptime which currently fails to parse %y-%m formats.

But that shouldn't apply here because 1) I have day and 2) it seems to work when I only have a date with a year w/o century. I think the problem in my specification is in how it interacts with the guessing/ training.

Lubridate is so close to making this tiresome task a lot easier, I think. Is there a way to get lubridate , or any other method, to generically parse the dates in test?

Upvotes: 3

Views: 74

Answers (1)

rbatt
rbatt

Reputation: 4807

This answer only addresses the particular symptom of the wrong century; although that's the only problem I noticed in my test data. I would be happy to see someone else's approach to robustly converting dates; if I've learned anything, it that there are always more surprises in date formats, and that they usually have solutions. But for now, by heavily relying on the linked answer, this is my best:

foo <- function(x, orders, year=1940, ...){
  requireNamespace("lubridate", quietly=TRUE)
  x <- lubridate::parse_date_time(x, orders=orders, ...)
  m <- lubridate::year(x) %% 100
  year(x) <- ifelse(m > year %% 100, 1900+m, 2000+m)
  x
}

orders <- paste(rep(c("ymd", "mdy", "Ymd"),each=3), c("HM","H","M"))
foo(test, orders, truncated=2)

 [1] "2012-11-11 00:00:00 UTC" "2012-05-23 00:00:00 UTC" "1986-12-05 00:00:00 UTC" "2015-12-16 13:00:00 UTC" "1992-08-06 03:00:00 UTC" "2011-06-14 04:00:00 UTC"
 [7] "2014-10-31 00:52:00 UTC" "2014-06-15 14:37:00 UTC" "2006-02-10 00:00:00 UTC" "1995-06-26 00:00:00 UTC" "1982-10-03 00:00:00 UTC"

the year argument should basically be the earliest year present in your vector of character dates, x.

Upvotes: 1

Related Questions