Dean MacGregor
Dean MacGregor

Reputation: 18661

How to avoid timezone offset when using read_csv_arrow

Let's say I have a csv file. For example, this one, https://www.misoenergy.org/planning/generator-interconnection/GI_Queue/gi-interactive-queue/#

If I do

miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
miso_queue %>% collect()


# A tibble: 3,343 x 24
`Project #` `Request Status` `Queue Date`        `Withdrawn Date`    `Done Date`         `Appl In Service ~` `Transmission ~` County State
<chr>       <chr>            <dttm>              <dttm>              <dttm>              <dttm>              <chr>            <chr>  <chr>
1 E002        Done             2013-09-12 20:00:00 NA                  2003-12-12 19:00:00 NA                  Entergy          Point~ LA   
2 E291        Done             2012-05-14 20:00:00 NA                  2013-10-21 20:00:00 2015-12-31 19:00:00 Entergy          NA     TX   
3 G001        Withdrawn        1995-11-07 19:00:00 NA                  NA                  NA                  American Transm~ Brown~ WI   
4 G002        Done             1998-11-30 19:00:00 NA                  NA                  NA                  LG&E and KU Ser~ Trimb~ KY 

It seems like it's assuming the file is in GMT and then converts the GMT representation of the date to my local time zone (Eastern).

I can do Sys.setenv(TZ="GMT") before I load the file and then that avoids the offset issue.

Sys.setenv(TZ="GMT")
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
miso_queue %>% collect()

# A tibble: 3,343 x 24
  `Project #` `Request Status` `Queue Date`        `Withdrawn Date`    `Done Date`         `Appl In Service ~` `Transmission ~` County State
  <chr>       <chr>            <dttm>              <dttm>              <dttm>              <dttm>              <chr>            <chr>  <chr>
  1 E002        Done             2013-09-13 00:00:00 NA                  2003-12-13 00:00:00 NA                  Entergy          Point~ LA   
  2 E291        Done             2012-05-15 00:00:00 NA                  2013-10-22 00:00:00 2016-01-01 00:00:00 Entergy          NA     TX   
  3 G001        Withdrawn        1995-11-08 00:00:00 NA                  NA                  NA                  American Transm~ Brown~ WI   
  4 G002        Done             1998-12-01 00:00:00 NA                  NA                  NA                  LG&E and KU Ser~ Trimb~ KY   

While setting my session tz to GMT isn't really too onerous, I'm wondering if there's a way to have it either assume the file is the same as my local time zone and just keep it that way or if it wants to assume it's GMT in the file then just keep it in GMT regardless of my local timezone.

Upvotes: 0

Views: 178

Answers (1)

Will Jones
Will Jones

Reputation: 223

It seems like it's assuming the file is in GMT and then converts the GMT representation of the date to my local time zone (Eastern).

Actually, the timezone conversion you are seeing just happens when you print. You can see this if you save the data frame to a variable and print it before and after you change your current timezone:

miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
df <- miso_queue %>% collect()
Sys.setenv(TZ="US/Pacific")
test[,"Queue Date"]
# # A tibble: 3,343 × 1
# `Queue Date`       
# <dttm>             
#   1 2013-09-12 17:00:00
# 2 2012-05-14 17:00:00
# 3 1995-11-07 16:00:00
# 4 1998-11-30 16:00:00
# 5 1998-11-30 16:00:00
# 6 1998-11-30 16:00:00
# 7 1999-02-14 16:00:00
# 8 1999-02-14 16:00:00
# 9 1999-07-29 17:00:00
# 10 1999-08-12 17:00:00
# # … with 3,333 more rows
Sys.setenv(TZ="GMT")
test[,"Queue Date"]
# # A tibble: 3,343 × 1
# `Queue Date`       
# <dttm>             
#   1 2013-09-13 00:00:00
# 2 2012-05-15 00:00:00
# 3 1995-11-08 00:00:00
# 4 1998-12-01 00:00:00
# 5 1998-12-01 00:00:00
# 6 1998-12-01 00:00:00
# 7 1999-02-15 00:00:00
# 8 1999-02-15 00:00:00
# 9 1999-07-30 00:00:00
# 10 1999-08-13 00:00:00
# # … with 3,333 more rows

However, in the example you showed there is no time data, so you might be better off reading that column as a date instead of a timestamp. Unfortunately right now I think Arrow only lets you parse as a date right now if you provide the schema for the whole table. One alternative would be to parse the date columns after reading.

Upvotes: 4

Related Questions