Issue with timestamp parsing in read_csv_arrow from arrow package in R

Question

Suppose there is a csv file named ta_sample.csv as under:

"BILL_DT","AMOUNT"
"2015-07-27T18:30:00Z",16000
"2015-07-07T18:30:00Z",6110
"2015-07-26T18:30:00Z",250
"2015-07-22T18:30:00Z",1000
"2015-07-06T18:30:00Z",2640000

Reading the above using read_csv_arrow and customizing the column types which is always needed in actual production data:

library(arrow)
read_csv_arrow(
  "ta_sample.csv",
  col_names = c("BILL_DT", "AMOUNT"),
  col_types = "td",
  skip = 1,
  timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))

the result is as under:

# A tibble: 5 x 2
  BILL_DT              AMOUNT
                  
1 2015-07-28 00:00:00   16000
2 2015-07-08 00:00:00    6110
3 2015-07-27 00:00:00     250
4 2015-07-23 00:00:00    1000
5 2015-07-07 00:00:00 2640000

The issue here is that the dates are increased by one day and the time disappears. It is worth mentioning here that data.table::fread() as well as readr::read_csv() read it properly, eg,

library(readr)
read_csv("ta_sample.csv")
# A tibble: 5 x 2
  BILL_DT              AMOUNT
                  
1 2015-07-27 18:30:00   16000
2 2015-07-07 18:30:00    6110
3 2015-07-26 18:30:00     250
4 2015-07-22 18:30:00    1000
5 2015-07-06 18:30:00 2640000

Parsing example values in BILL_DT column with strptime also work perfectly as under:

strptime(c("2015-07-27T18:30:00Z", "2015-07-07T18:30:00Z"), "%Y-%m-%dT%H:%M:%SZ")
[1] "2015-07-27 18:30:00 IST" "2015-07-07 18:30:00 IST"

What parameters in read_csv_arrow need to be adjusted to get results identical to that given by readr::read_csv() ?

Jonathan Keane · Accepted Answer

There are a few things going on here, but they all relate to timezones + how they are interpreted by various parts of R + Arrow + other packages.

When Arrow reads in timestamps, it treats the values as if they were UTC. Arrow does not yet have the ability to specify alternative timezones when parsing[1], so stores these values as timezoneless (and assumes UTC). Though in this case, since the timestamps you have are UTC (according to ISO_8601, the Z at the end means UTC) they are stored correctly in Arrow as timezoneless UTC timestamps. The values of the timestamps are the same (that is, they represent the same time in UTC), the difference is in how they are displayed: are they displayed as the time in UTC or are they displayed in the local timezone.

When the timestamps are converted into R, the timezonelessness is preserved:

> from_arrow <- read_csv_arrow(
+   "ta_sample.csv",
+   col_names = c("BILL_DT", "AMOUNT"),
+   col_types = "td",
+   skip = 1,
+   timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
> 
> attr(from_arrow$BILL_DT, "tzone")
NULL

R defaults to displaying timestamps without a tzone attribute in the local timezone (for me it's currently CDT, for you it looks like it's IST). And, note that timestamps with an explicit timezone are displayed in that timezone.

> from_arrow$BILL_DT
[1] "2015-07-27 13:30:00 CDT" "2015-07-07 13:30:00 CDT"
[3] "2015-07-26 13:30:00 CDT" "2015-07-22 13:30:00 CDT"
[5] "2015-07-06 13:30:00 CDT"

If you would like to display the UTC timestamps, you can do a few things:

Explicitly set the tzone attribute (or you could use lubridate::with_tz() for the same operation):

> attr(from_arrow$BILL_DT, "tzone") <- "UTC"
> from_arrow$BILL_DT
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[3] "2015-07-26 18:30:00 UTC" "2015-07-22 18:30:00 UTC"
[5] "2015-07-06 18:30:00 UTC"

You can set the timezone in your R session so that when R goes to display the time it uses UTC (Note: the tzone attribute is still unset here, but the display is UTC because the session timezone is set to UTC)

> Sys.setenv(TZ="UTC")
> from_arrow <- read_csv_arrow(
 3.   "ta_sample.csv",
 4.   col_names = c("BILL_DT", "AMOUNT"),
 5.   col_types = "td",
 6.   skip = 1,
 7.   timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
> from_arrow$BILL_DT
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[3] "2015-07-26 18:30:00 UTC" "2015-07-22 18:30:00 UTC"
[5] "2015-07-06 18:30:00 UTC"
> attr(from_arrow$BILL_DT, "tzone")
NULL

You could read the data into an Arrow table, and cast the timestamp to have an explicit timezone in Arrow before pulling the data into R with collect(). This csv -> Arrow table -> data.frame is what happens under the hood, so there are no additional conversions going on here (other than the cast). And it can be useful + more efficient to do operations on the Arrow table if you have other transformations you are applying, though it is more code than the first two.

> library(arrow)
> library(dplyr)
> tab <- read_csv_arrow(
+     "ta_sample.csv",
+     col_names = c("BILL_DT", "AMOUNT"),
+     col_types = "td",
+     skip = 1,
+     as_data_frame = FALSE)
> 
> tab_df <- tab %>% 
+   mutate(BILL_DT_cast = cast(BILL_DT, timestamp(unit = "s", timezone = "UTC"))) %>%
+    collect()
> attr(tab_df$BILL_DT, "tzone")
NULL
> attr(tab_df$BILL_DT_cast, "tzone")
[1] "UTC"
> tab_df
# A tibble: 5 × 3
  BILL_DT              AMOUNT BILL_DT_cast       
                                
1 2015-07-27 13:30:00   16000 2015-07-27 18:30:00
2 2015-07-07 13:30:00    6110 2015-07-07 18:30:00
3 2015-07-26 13:30:00     250 2015-07-26 18:30:00
4 2015-07-22 13:30:00    1000 2015-07-22 18:30:00
5 2015-07-06 13:30:00 2640000 2015-07-06 18:30:00

This is also made a bit more confusing because base R's strptime() doesn't parse timezones (which is why you're seeing the same clock time but with IST in your example above). lubridate's[2] parsing functions do respect this, and you can see the difference here:

> lubridate::parse_date_time(c("2015-07-27T18:30:00Z", "2015-07-07T18:30:00Z"), "YmdHMS")
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"

[1] Though we have two issues related to adding this functionality https://issues.apache.org/jira/browse/ARROW-12820 and https://issues.apache.org/jira/browse/ARROW-13348

[2] And, lubridate's docs even mention this:

ISO8601 signed offset in hours and minutes from UTC. For example -0800, -08:00 or -08, all represent 8 hours behind UTC. This format also matches the Z (Zulu) UTC indicator. Because base::strptime() doesn't fully support ISO8601 this format is implemented as an union of 4 orders: Ou (Z), Oz (-0800), OO (-08:00) and Oo (-08). You can use these four orders as any other but it is rarely necessary. parse_date_time2() and fast_strptime() support all of the timezone formats. https://lubridate.tidyverse.org/reference/parse_date_time.html

Issue with timestamp parsing in read_csv_arrow from arrow package in R

Answers (1)

Related Questions