San
San

Reputation: 578

DATE-TIME / TIMESTAMP field in parquet file shown as numbers in Parquet file viewers

Suppose I have a data.frame/tibble as under:

library(readr)
library(arrow)

# testFyl was originally read from a csv file with readr::read_csv()

testFyl <- structure(list(
  BILL_NO = c("36/2015-16", "39/15-16", "771", "254", "731", "610", "200", "23 /2015-16", "21/2015-16", "30/15-16"),
  BILL_DT_TIME = structure(c(1438021800, 1436898600, 1438021800, 1436293800, 1437935400, 1437589800, 1436207400, 1438108200, 1437676200, 1437330600), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
  BILL_DT = structure(c(16643, 16630, 16643, 16623, 16642, 16638, 16622, 16644, 16639, 16635), class = "Date")),
  spec = structure(list(cols = list(BILL_NO = structure(list(), class = c("collector_character", "collector")), BILL_DT_TIME = structure(list(format = ""), class = c("collector_datetime", "collector")), BILL_DT = structure(list(format = ""), class = c("collector_date", "collector"))), default = structure(list(), class = c("collector_guess", "collector")), delim = ","), class = "col_spec"), row.names = c(NA, -10L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"))

testFyl looks like:

# A tibble: 10 x 3
   BILL_NO     BILL_DT_TIME        BILL_DT   
   <chr>       <dttm>              <date>    
 1 36/2015-16  2015-07-27 18:30:00 2015-07-27
 2 39/15-16    2015-07-14 18:30:00 2015-07-14
 3 771         2015-07-27 18:30:00 2015-07-27
 4 254         2015-07-07 18:30:00 2015-07-07
 5 731         2015-07-26 18:30:00 2015-07-26
 6 610         2015-07-22 18:30:00 2015-07-22
 7 200         2015-07-06 18:30:00 2015-07-06
 8 23 /2015-16 2015-07-28 18:30:00 2015-07-28
 9 21/2015-16  2015-07-23 18:30:00 2015-07-23
10 30/15-16    2015-07-19 18:30:00 2015-07-19

Note that the BILL_DT column has same dates as BILL_DT_TIME column with the time information removed.

Now, write this table in parquet format with

write_parquet(testFyl, "testFyl.parquet")

While reading this parquet file back into R with

read_parquet("testFyl.parquet")

everything is absolutely fine. The table is exactly same as above, as expected.

However, when I load this parquet file with the following two external parquet file viewing tools, they show dates in formats that I don't understand:

1. ParquetViewer from https://github.com/mukunku/ParquetViewer

ParquetViewer Screenshot

Here, the BILL_DT_TIME column shows numbers which are strange to me.

2. Bigdata File Viewer from https://github.com/Eugene-Mark/bigdata-file-viewer

Bigdata File Viewer Screenshot

Here, BILL_DT_TIME as well as BILL_DT columns show numbers which I don't understand. These numbers show up when the data.frame is saved with dput function.

Seeing the date-time (strange) and date (understandable) columns in ParquetViewer, it seems that some formatting can be done to the date-time column in R environment before exporting it in parquet format so that it will show up properly in ParquetViewer. Can anyone help me figure it out?

Edit: Meanwhile, I've raised an issue (feature request) at github at https://github.com/mukunku/ParquetViewer/issues/40

Edit2: The developer has graciously updated ParquetViewer to show timestamps in human-intelligible format. So this issue is resolved.

Upvotes: 3

Views: 16850

Answers (1)

crestor
crestor

Reputation: 1476

That format is called "timestamp". It's an Unix timestamp expressed in microseconds.

https://www.epochconverter.com/

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

Current GUI viewer applications for those formats are quite limited.

Upvotes: 5

Related Questions