Reputation: 3677
I am mostly reproducing here an issue that I have seen raised on forum.aws in hopes that the answers/explanations of the stackoverflow community are more thorough, and illuminating than the discussion on the forum.
Here is my experience of the issue:
I make a parquet files from a dataframe in python using pandas, and cast a field/column say birthday as a datetime64[ns]
using pandas.to_datetime
. This part of the process seems flawless as I can read the parquet files using pandas.read_parquet
and get what I expect, namely the dates entered in datetime.
However, when I load said parquet file to AWS and put a athena
table on it, reading the same birthday column yields junk dates that in no way match the ones in the parquet file. For example:
t = pandas.DataFrame([['Haiti',pandas.to_datetime('1804-01-01')]],columns=['Country','Independence'])
t.to_parquet("s3://<mybucket>/tmp/t.parquet")
|Country | Independence|
|--------|-------------|
|Haiti | 1804-01-01 |
CREATE EXTERNAL TABLE IF NOT EXISTS default.mytable (
`Country` string,
`Independence` timestamp
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://<mybucket>/tmp/'
TBLPROPERTIES ('has_encrypted_data'='false');
SELECT * FROM "default"."mytable" limit 10;
|Country | Independence |
|--------|--------------------------|
|Haiti |-164033-12-18 00:00:00.000|
Upvotes: 2
Views: 3107
Reputation: 3677
You can force to_parquet to write in a format Athena will understand with "coerce_timestamps":
t = pandas.DataFrame([['Haiti',pandas.to_datetime('1804-01-01')]],columns=['Country','Independence'])
t.to_parquet("s3://<mybucket>/tmp/t.parquet", coerce_timestamps='ms')
|Country | Independence|
|--------|-------------|
|Haiti | 1804-01-01 |
CREATE EXTERNAL TABLE IF NOT EXISTS default.mytable (
`Country` string,
`Independence` timestamp
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://<mybucket>/tmp/'
TBLPROPERTIES ('has_encrypted_data'='false');
SELECT * FROM "default"."mytable" limit 10;
|Country | Independence |
|--------|-----------------------|
|Haiti |1804-01-01 00:00:00.000|
Upvotes: 3