Reputation: 1071
I have a Dask data frame that has two columns, a date and a value.
I store it like so:
ddf.to_parquet('/some/folder', engine='pyarrow', overwrite=True)
I'm expecting Dask to store the date column as date in Parquet, but when I query it with Apache Drill I get 16 digit numbers (I would say timestamps) instead of dates. For example I get:
1546300800000000 instead of 2019-01-01
1548979200000000 instead of 2019-02-01
Is there a way to tell Dask to store columns as dates? How can I run a select with Apache Drill and get the dates? I tried using SELECT CAST
in Drill but it doesn't convert the numbers to dates.
Upvotes: 1
Views: 1430
Reputation: 576
Here's a link to the Drill docs about the TO_TIMESTAMP()
function. (https://drill.apache.org/docs/data-type-conversion/#to_timestamp) I think @mdurant is correct in his approach.
I would try either:
SELECT TO_TIMESTAMP(<date_col>) FROM ...
or
SELECT TO_TIMSTAMP((<date_col> / 1000)) FROM ...
Upvotes: 1
Reputation: 28683
If memory serves, Drill uses an old non-standard of INT96 time stamps, which was never supported by parquet. A parquet timestamp is essentially a UNIX timestamp, as an int64, and with various precision. Drill must have a function to correctly convert this its internal format.
I am no expert on Drill, but it seems you need to first divide your integer by the appropriate power of 10, (see this answer). This syntac is probably wrong, but might give you the idea:
SELECT TO_TIMESTAMP((mycol as FLOAT) / 1000) FROM ...;
Upvotes: 1
Reputation: 16571
Not sure if is relevant for you, but it seems that you are interested only in the date value (ignoring hours, minutes, etc.). If so, you can explicitly convert timestamp information into date string using .dt.date
.
import pandas as pd
import dask.dataframe as dd
sample_dates = [
'2019-01-01 00:01:00',
'2019-01-02 05:04:02',
'2019-01-02 15:04:02'
]
df = pd.DataFrame(zip(sample_dates, range(len(sample_dates))), columns=['datestring', 'value'])
ddf = dd.from_pandas(df, npartitions=2)
# convert to timestamp and calculate as unix time (relative to 1970)
ddf['unix_timestamp_seconds'] = (ddf['datestring'].astype('M8[s]') - pd.to_datetime('1970-01-01')).dt.total_seconds()
# convert to timestamp format and extract dates
ddf['datestring'] = ddf['datestring'].astype('M8[s]').dt.date
ddf.to_parquet('test.parquet', engine='pyarrow', write_index=False, coerce_timestamps='ms')
For time conversion, you can use .astype
or dd.to_datetime
, see answers to this question. There is also a very similar question and answer, which suggests that ensuring that the timestamp is downcasted to ms
resolves the issue.
So playing around with the values you provided it's possible to see that the core problem is a mismatch in the scaling of the variable:
# both yield: Timestamp('2019-01-01 00:00:00')
pd.to_datetime(1546300800000000*1000, unit='ns')
pd.to_datetime(1546300800000000/1000000, unit='s')
Upvotes: 3