Reputation: 1071

Storing with Dask date/timestamp columns in Parquet

I have a Dask data frame that has two columns, a date and a value.

I store it like so:

ddf.to_parquet('/some/folder', engine='pyarrow', overwrite=True)

I'm expecting Dask to store the date column as date in Parquet, but when I query it with Apache Drill I get 16 digit numbers (I would say timestamps) instead of dates. For example I get:

1546300800000000 instead of 2019-01-01

1548979200000000 instead of 2019-02-01

Is there a way to tell Dask to store columns as dates? How can I run a select with Apache Drill and get the dates? I tried using SELECT CAST in Drill but it doesn't convert the numbers to dates.

Upvotes: 1

Answers (3)

cgivre

Reputation: 576

Here's a link to the Drill docs about the TO_TIMESTAMP() function. (https://drill.apache.org/docs/data-type-conversion/#to_timestamp) I think @mdurant is correct in his approach.

I would try either:

SELECT TO_TIMESTAMP(<date_col>) FROM ...

SELECT TO_TIMSTAMP((<date_col> / 1000)) FROM ...

Upvotes: 1

mdurant

Reputation: 28683

If memory serves, Drill uses an old non-standard of INT96 time stamps, which was never supported by parquet. A parquet timestamp is essentially a UNIX timestamp, as an int64, and with various precision. Drill must have a function to correctly convert this its internal format.

I am no expert on Drill, but it seems you need to first divide your integer by the appropriate power of 10, (see this answer). This syntac is probably wrong, but might give you the idea:

SELECT TO_TIMESTAMP((mycol as FLOAT) / 1000) FROM ...;

Upvotes: 1

SultanOrazbayev

Reputation: 16571

Not sure if is relevant for you, but it seems that you are interested only in the date value (ignoring hours, minutes, etc.). If so, you can explicitly convert timestamp information into date string using .dt.date.

import pandas as pd
import dask.dataframe as dd

sample_dates = [
    '2019-01-01 00:01:00',
    '2019-01-02 05:04:02',
    '2019-01-02 15:04:02'
]

df = pd.DataFrame(zip(sample_dates, range(len(sample_dates))), columns=['datestring', 'value'])

ddf = dd.from_pandas(df, npartitions=2)

# convert to timestamp and calculate as unix time (relative to 1970)
ddf['unix_timestamp_seconds'] = (ddf['datestring'].astype('M8[s]') - pd.to_datetime('1970-01-01')).dt.total_seconds()

# convert to timestamp format and extract dates
ddf['datestring'] = ddf['datestring'].astype('M8[s]').dt.date

ddf.to_parquet('test.parquet', engine='pyarrow', write_index=False, coerce_timestamps='ms')

For time conversion, you can use .astype or dd.to_datetime, see answers to this question. There is also a very similar question and answer, which suggests that ensuring that the timestamp is downcasted to ms resolves the issue.

So playing around with the values you provided it's possible to see that the core problem is a mismatch in the scaling of the variable:

# both yield: Timestamp('2019-01-01 00:00:00')

pd.to_datetime(1546300800000000*1000, unit='ns')
pd.to_datetime(1546300800000000/1000000, unit='s')

Upvotes: 3

Storing with Dask date/timestamp columns in Parquet

Answers (3)

Related Questions