user3357059
user3357059

Reputation: 1192

Reading dates columns using the fread function of the data.table package

I am reading some .csv files that have non-standard format date format like m/d/Y and using the fread function the dates are read as character like m/d/y losing the four digit characters for the year. I tried using the colClasses format to read the date columns as dates but they are still read as character with a two character year and with a warning that the dates are not in a standard unambiguous format, see example below.

library(data.table)
library(dplyr)

set.seed(1)

DT <- data.table(orig_beg_date = sample(seq.Date(from = as.Date("1900-01-01"),
                                             to = Sys.Date(), by = "month"),
                                    5)) %>% 
mutate(beg_date = format(orig_beg_date, "%m/%d/%y"))

DT 
   orig_beg_date beg_date
          <Date>   <char>
1:    1984-09-01 09/01/84
2:    1956-07-01 07/01/56
3:    1910-09-01 09/01/10
4:    1977-06-01 06/01/77
5:    1939-03-01 03/01/39

When I format the beg_date column to date format I get the following:

DT$beg_date %>% as.Date(., format = "%m/%d/%y")
[1] "1984-09-01" "2056-07-01" "2010-09-01" "1977-06-01" "2039-03-01"

The year of the second date becomes 2056 from 1956. The third date also has the issue, it becomes 2010 from 1910. The csv file would show these two dates as 7/1/1956 and 9/1/2010. How can I read the dates correctly using the colClasses?

Upvotes: 1

Views: 55

Answers (1)

sonshine
sonshine

Reputation: 591

I'm confused on how the dates are stored in the .csv file, but here are a couple options.

If the dates are stored as mdY with the full year (12-25-1910), then I think reading as a character and using lubridate is a good option.

set.seed(1)

DT <- data.table::data.table(
    orig_beg_date = sample(
        seq.Date(
            from = as.Date("1900-01-01"),
            to = Sys.Date(), by = "month"
        ),
        5
    )
)

# format to mdY
DT_mdY <- DT |>
    dplyr::mutate(beg_date = format(orig_beg_date, "%m/%d/%Y")) |>
    dplyr::select(mdy = beg_date)

# write
data.table::fwrite(DT_mdY, "mdY.csv")

# read in and parse date
x <- data.table::fread("./mdY.csv", colClasses = c(mdy = "character"))
x[, mdy := lubridate::mdy(mdy)]

print(x)
          mdy
       <Date>
1: 1984-09-01
2: 1956-07-01
3: 1910-09-01
4: 1977-06-01
5: 1939-03-01

If the dates are mdy with a truncated year (12-25-10), I'm able to replicate your results since the year part is missing information. I'm not sure how it would know the correct year other than guessing.

# format
DT_mdy <- DT |>
    dplyr::mutate(beg_date = format(orig_beg_date, "%m/%d/%y")) |>
    dplyr::select(mdy = beg_date)

# write
data.table::fwrite(DT_mdy, "mdy.csv")

x <- data.table::fread("./mdy.csv", colClasses = c(mdy = "character"))
x[, mdy := lubridate::mdy(mdy)]

print(x)
          mdy
       <Date>
1: 1984-09-01
2: 2056-07-01
3: 2010-09-01
4: 1977-06-01
5: 2039-03-01

If you knew the maximum year, then you could alter the year part after the fact.

f <- function(x) {
    lubridate::year(x) <- lubridate::year(x) - 100
    return(x)
}

x[lubridate::year(mdy) >= 2000, mdy := f(mdy)]

print(x)
          mdy
       <Date>
1: 1984-09-01
2: 1956-07-01
3: 1910-09-01
4: 1977-06-01
5: 1939-03-01

However, if you have a mix of years that are represented the same (12-24-10 meaning 1910-12-24 and 12-25-10 meaning 2010-12-25), I think you'd be out of luck without some prior knowledge of which is which.

From the error message you're referring to, it seems likely that the dates are not all stored the same format. In that case, you could use lubridate::parse_date_time() after reading in as a character if you know the formats in advance.

date <- c("1910-12-24", "12-24-1910")
lubridate::parse_date_time(date, orders = c("ymd", "mdy"))
[1] "1910-12-24 UTC" "1910-12-24 UTC"

One final thing you might try is reading a limited number of rows and seeing if it parses correctly. Might help you narrow down any issues.

data.table::fread("./mdY.csv", nrows = 2)

Upvotes: 0

Related Questions