Christian
Christian

Reputation: 605

Apache-Drill doesn't understand Pandas datetime64[ns]

I'm using Pyarrow, Pyarrow.Parquet as well as Pandas. When I send a Pandas datetime64[ns] series to a Parquet file and load it again via a drill query, the query shows an Integer like: 1467331200000000 which seems to be something else than a UNIX timestamp.

The query looks like this:

SELECT workspace.id-column AS id-column, workspace.date-column AS date-column

When I open that file within Python again, it loads correctly and still has its datetime64[ns] type.

Any idea what's going wrong and how to solve this? I want this value being shown as a regular date.

Upvotes: 3

Views: 598

Answers (2)

Christian
Christian

Reputation: 605

Ok, I found a solution some days ago which I would like to share. I think I initially missed something. It's very important to downcast to [ms] as well as allowing truncating timestamps before sending the dataframe to Parquet for becoming able to open it issue free in Drill:

pq.write_table(table, rf'{name}.parquet',
           coerce_timestamps='ms',
           allow_truncated_timestamps=True)

When I define a view in Drill I can cast that column as date or timestamp as required.

Upvotes: 3

Vova Vysotskyi
Vova Vysotskyi

Reputation: 661

Could you please share parquet schema? Is it contains TIMESTAMP logical type for this column?

For now, you can write your custom UDF which converts BigInt nanoseconds value to timestamp, or use built-in functions (not sure about ns, but there are a lot of functions which accept millis):

select to_timestamp(1467331200000/1000);
+-----------------------+
|        EXPR$0         |
+-----------------------+
| 2016-07-01 03:00:00.0 |
+-----------------------+

Upvotes: 0

Related Questions