Reputation: 578
Suppose there is a csv
file named ta_sample.csv
as under:
"BILL_DT","AMOUNT"
"2015-07-27T18:30:00Z",16000
"2015-07-07T18:30:00Z",6110
"2015-07-26T18:30:00Z",250
"2015-07-22T18:30:00Z",1000
"2015-07-06T18:30:00Z",2640000
Reading the above using read_csv_arrow
and customizing the column types which is always needed in actual production data:
library(arrow)
read_csv_arrow(
"ta_sample.csv",
col_names = c("BILL_DT", "AMOUNT"),
col_types = "td",
skip = 1,
timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
the result is as under:
# A tibble: 5 x 2
BILL_DT AMOUNT
<dttm> <dbl>
1 2015-07-28 00:00:00 16000
2 2015-07-08 00:00:00 6110
3 2015-07-27 00:00:00 250
4 2015-07-23 00:00:00 1000
5 2015-07-07 00:00:00 2640000
The issue here is that the dates are increased by one day and the time disappears. It is worth mentioning here that data.table::fread()
as well as readr::read_csv()
read it properly, eg,
library(readr)
read_csv("ta_sample.csv")
# A tibble: 5 x 2
BILL_DT AMOUNT
<dttm> <dbl>
1 2015-07-27 18:30:00 16000
2 2015-07-07 18:30:00 6110
3 2015-07-26 18:30:00 250
4 2015-07-22 18:30:00 1000
5 2015-07-06 18:30:00 2640000
Parsing example values in BILL_DT
column with strptime
also work perfectly as under:
strptime(c("2015-07-27T18:30:00Z", "2015-07-07T18:30:00Z"), "%Y-%m-%dT%H:%M:%SZ")
[1] "2015-07-27 18:30:00 IST" "2015-07-07 18:30:00 IST"
What parameters in read_csv_arrow
need to be adjusted to get results identical to that given by readr::read_csv()
?
Upvotes: 0
Views: 830
Reputation: 318
There are a few things going on here, but they all relate to timezones + how they are interpreted by various parts of R + Arrow + other packages.
When Arrow reads in timestamps, it treats the values as if they were UTC. Arrow does not yet have the ability to specify alternative timezones when parsing[1], so stores these values as timezoneless (and assumes UTC). Though in this case, since the timestamps you have are UTC (according to ISO_8601, the Z
at the end means UTC) they are stored correctly in Arrow as timezoneless UTC timestamps. The values of the timestamps are the same (that is, they represent the same time in UTC), the difference is in how they are displayed: are they displayed as the time in UTC or are they displayed in the local timezone.
When the timestamps are converted into R, the timezonelessness is preserved:
> from_arrow <- read_csv_arrow(
+ "ta_sample.csv",
+ col_names = c("BILL_DT", "AMOUNT"),
+ col_types = "td",
+ skip = 1,
+ timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
>
> attr(from_arrow$BILL_DT, "tzone")
NULL
R defaults to displaying timestamps without a tzone
attribute in the local timezone (for me it's currently CDT, for you it looks like it's IST). And, note that timestamps with an explicit timezone are displayed in that timezone.
> from_arrow$BILL_DT
[1] "2015-07-27 13:30:00 CDT" "2015-07-07 13:30:00 CDT"
[3] "2015-07-26 13:30:00 CDT" "2015-07-22 13:30:00 CDT"
[5] "2015-07-06 13:30:00 CDT"
If you would like to display the UTC timestamps, you can do a few things:
tzone
attribute (or you could use lubridate::with_tz()
for the same operation):> attr(from_arrow$BILL_DT, "tzone") <- "UTC"
> from_arrow$BILL_DT
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[3] "2015-07-26 18:30:00 UTC" "2015-07-22 18:30:00 UTC"
[5] "2015-07-06 18:30:00 UTC"
tzone
attribute is still unset here, but the display is UTC because the session timezone is set to UTC)> Sys.setenv(TZ="UTC")
> from_arrow <- read_csv_arrow(
3. "ta_sample.csv",
4. col_names = c("BILL_DT", "AMOUNT"),
5. col_types = "td",
6. skip = 1,
7. timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
> from_arrow$BILL_DT
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[3] "2015-07-26 18:30:00 UTC" "2015-07-22 18:30:00 UTC"
[5] "2015-07-06 18:30:00 UTC"
> attr(from_arrow$BILL_DT, "tzone")
NULL
collect()
. This csv -> Arrow table -> data.frame is what happens under the hood, so there are no additional conversions going on here (other than the cast). And it can be useful + more efficient to do operations on the Arrow table if you have other transformations you are applying, though it is more code than the first two.> library(arrow)
> library(dplyr)
> tab <- read_csv_arrow(
+ "ta_sample.csv",
+ col_names = c("BILL_DT", "AMOUNT"),
+ col_types = "td",
+ skip = 1,
+ as_data_frame = FALSE)
>
> tab_df <- tab %>%
+ mutate(BILL_DT_cast = cast(BILL_DT, timestamp(unit = "s", timezone = "UTC"))) %>%
+ collect()
> attr(tab_df$BILL_DT, "tzone")
NULL
> attr(tab_df$BILL_DT_cast, "tzone")
[1] "UTC"
> tab_df
# A tibble: 5 × 3
BILL_DT AMOUNT BILL_DT_cast
<dttm> <dbl> <dttm>
1 2015-07-27 13:30:00 16000 2015-07-27 18:30:00
2 2015-07-07 13:30:00 6110 2015-07-07 18:30:00
3 2015-07-26 13:30:00 250 2015-07-26 18:30:00
4 2015-07-22 13:30:00 1000 2015-07-22 18:30:00
5 2015-07-06 13:30:00 2640000 2015-07-06 18:30:00
This is also made a bit more confusing because base R's strptime()
doesn't parse timezones (which is why you're seeing the same clock time but with IST in your example above). lubridate's[2] parsing functions do respect this, and you can see the difference here:
> lubridate::parse_date_time(c("2015-07-27T18:30:00Z", "2015-07-07T18:30:00Z"), "YmdHMS")
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[1] Though we have two issues related to adding this functionality https://issues.apache.org/jira/browse/ARROW-12820 and https://issues.apache.org/jira/browse/ARROW-13348
[2] And, lubridate's docs even mention this:
ISO8601 signed offset in hours and minutes from UTC. For example -0800, -08:00 or -08, all represent 8 hours behind UTC. This format also matches the Z (Zulu) UTC indicator. Because base::strptime() doesn't fully support ISO8601 this format is implemented as an union of 4 orders: Ou (Z), Oz (-0800), OO (-08:00) and Oo (-08). You can use these four orders as any other but it is rarely necessary. parse_date_time2() and fast_strptime() support all of the timezone formats. https://lubridate.tidyverse.org/reference/parse_date_time.html
Upvotes: 5