skan
skan

Reputation: 7720

R: data.table. How to save dates properly with fwrite?

I have a dataset. I can choose to load it on R from a Stata file or from a SPSS file.
In both cases it's loaded properly with the haven package. The dates are recognized properly.

But when I save it to disk with data.table's fwrite function.
fwrite(ppp, "ppp.csv", sep=",", col.names = TRUE)

I have a problem, the dates dissapear and are converted to different numbers. For example, the date 1967-08-06 is saved in the csv file as -879

I've also tried playing with fwrite options, such as quote=FALSE, with no success.

I've uploaded a small sample of the files, the spss, the stata and the saved csv.

and this is the code, in order to do things easier for you.

library(haven)
library(data.table)
ppp <- read_sav("pspss.sav") # choose one of these two.
ppp <- read_dta("pstata.dta")  # choose one of these two.
fwrite(ppp, "ppp.csv",  sep=",", col.names = TRUE) 

The real whole table has more than one thousand variables and one million individuals. That's why I would like to use a fast way to do things.

http://www73.zippyshare.com/v/OwzwbyQq/file.html

This is for @ArtificialBreeze:

> head(my)
# A tibble: 6 x 9
  ID_2006_2011              TIS FECHA_NAC_2006    año2006 Edad_31_12_2006 SEXO_2006
         <dbl>            <chr>         <date>     <date>           <dbl>     <chr>
1  1.60701e+11 BBNR670806504015     1967-08-06 2006-12-31              39         M
2  1.60701e+11 BCBD580954916014     1958-09-14 2006-12-31              48         F
3  1.60701e+11 BCBL451245916015     1945-12-05 2006-12-31              61         F
4  1.60701e+11 BCGR610904916012     1961-09-04 2006-12-31              45         M
5  1.60701e+11 BCMR580148916015     1958-01-08 2006-12-31              48         F
6  1.60701e+11 BCMX530356917018     1953-03-16 2006-12-31              53         F
# ... with 3 more variables: PAIS_NAC_2006 <dbl>, FECHA_ALTA_TIS_2006 <date>,
#   FECHA_ALTA_TIS_2006n <date>

Upvotes: 7

Views: 6371

Answers (5)

cmoreno
cmoreno

Reputation: 41

You need to add the argument: dateTimeAs = "ISO". By adding the argument dateTimeAs = and specifying the appropriate option, you will get dates writting in your csv file with the desired format AND with their respective time zone. This is particularly important in the case of dealing with POSIXct variables, which are time zone dependant. The lack of this argument might affect the timestamps written in the csv file by shifting dates and times according to the difference of hours between time zones. Thus, the date/time variable POSIXct, you will need to add: dateTimeAs = "write.csv" ; unfortunately this option can be slow (https://www.rdocumentation.org/packages/data.table/versions/1.10.0/topics/fwrite?). Good luck!!!

Upvotes: 0

Matt Dowle
Matt Dowle

Reputation: 59602

Since this question was asked 6 months ago, fwrite has improved and been released to CRAN. I believe it should work as you wanted now; i.e. fast, direct and convenient date formatting. It now has the dateTimeAs argument as follows, copied from fwrite's manual page for v1.10.0 as on CRAN now. As time progresses, please check the latest version of the manual page.

====

dateTimeAs : How Date/IDate, ITime and POSIXct items are written.

  • "ISO" (default) - 2016-09-12, 18:12:16 and 2016-09-12T18:12:16.999999Z. 0, 3 or 6 digits of fractional seconds are printed if and when present for convenience, regardless of any R options such as digits.secs. The idea being that if milli and microseconds are present then you most likely want to retain them. R's internal UTC representation is written faithfully to encourage ISO standards, stymie timezone ambiguity and for speed. An option to consider is to start R in the UTC timezone simply with "$ TZ='UTC' R" at the shell (NB: it must be one or more spaces between TZ='UTC' and R, anything else will be silently ignored; this TZ setting applies just to that R process) or Sys.setenv(TZ='UTC') at the R prompt and then continue as if UTC were local time.

  • "squash" - 20160912, 181216 and 20160912181216999. This option allows fast and simple extraction of yyyy, mm, dd and (most commonly to group by) yyyymm parts using integer div and mod operations. In R for example, one line helper functions could use %/%10000, %/%100%%100, %%100 and %/%100 respectively. POSIXct UTC is squashed to 17 digits (including 3 digits of milliseconds always, even if 000) which may be read comfortably as integer64 (automatically by fread()).

  • "epoch" - 17056, 65536 and 1473703936.999999. The underlying number of days or seconds since the relevant epoch (1970-01-01, 00:00:00 and 1970-01-01T00:00:00Z respectively), negative before that (see ?Date). 0, 3 or 6 digits of fractional seconds are printed if and when present.

  • "write.csv" - this currently affects POSIXct only. It is written as write.csv does by using the as.character method which heeds digits.secs and converts from R's internal UTC representation back to local time (or the "tzone" attribute) as of that historical date. Accordingly this can be slow. All other column types (including Date, IDate and ITime which are independent of timezone) are written as the "ISO" option using fast C code which is already consistent with write.csv.

The first three options are fast due to new specialized C code. The epoch to date-part conversion uses a fast approach by Howard Hinnant (see references) using a day-of-year starting on 1 March. You should not be able to notice any difference in write speed between those three options. The date range supported for Date and IDate is [0000-03-01, 9999-12-31]. Every one of these 3,652,365 dates have been tested and compared to base R including all 2,790 leap days in this range. This option applies to vectors of date/time in list column cells, too. A fully flexible format string (such as "%m/%d/%Y") is not supported. This is to encourage use of ISO standards and because that flexibility is not known how to make fast at C level. We may be able to support one or two more specific options if required.

====

Upvotes: 10

klamerka
klamerka

Reputation: 11

These numbers have sense :) It seems that fwrite change data format into "Matlab coding" where origin is "1970-01-01". If you read your data, you can simply change number into date using these code:

my$FECHA_NAC_2006<-as.Date(as.numeric(my$FECHA_NAC_2006),origin="1970-01-01")

For example

as.Date(-879,origin="1970-01-01")

[1] "1967-08-06"

Upvotes: 1

Krimonk
Krimonk

Reputation: 31

I had the same problem, and I just changed the date column to as.character before writing, and then changed it back to as.Date after reading. I don't know how it influences read and write times, but it was a good enough solution for me.

Upvotes: 3

skan
skan

Reputation: 7720

Since it seems there is no simple solution I'm trying to store column classes and change them back again.

I take the original dataset ppp,

areDates <- (sapply(ppp, class) == "Date")

I save it on an file and I can read it next time.

ppp <- fread("ppp.csv", encoding="UTF-8")

And now I change the classes of the newly read dataset back to the original one.

ppp[,names(ppp)[areDates] := lapply(.SD,as.Date),
 .SDcols = areDates ]

Maybe someone can written it better with a for loop and the command set.

 ppp[,lapply(.SD, setattr, "class", "Date") , 
.SDcols = areDates]

It can be also written with positions instead of a vector of TRUE and FALSE

Upvotes: 0

Related Questions