Artem
Artem

Reputation: 833

Querying timestamp data in Athena when the timestamp format in the underlying JSON files has changed

I'm querying data in AWS Athena from JSON files stored in S3. I've loaded all the JSON files into Athena using AWS Glue, and it's been working perfectly so far. However, the timestamp formatting has changed in the JSON files from

2018-03-23 15:00:30.998

to

2018-08-29T07:59:50.568Z

So the table ends up having entries like this

2018-08-29T07:59:42.803Z
2018-08-29T07:59:42.802Z
2018-08-29T07:59:32.500Z
2018-03-23 15:03:43.232
2018-03-23 15:03:44.697
2018-03-23 15:04:11.951

This results in parsing errors when I try to run queries against the full DB.

How do I accommodate this in AWS Glue (or Athena), so I don't have to split up the data when querying? I've tried looking into custom classifiers, but I'm unsure of how to use them in this particular case.

Thanks in advance.

Upvotes: 1

Views: 2864

Answers (2)

jbgorski
jbgorski

Reputation: 1939

Unfortunately you have to unify the data.

If you decide to use "2018-08-29T07:59:50.568Z" format you can read such data by using org.apache.hive.hcatalog.data.JsonSerDe library with the following serde property:

'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSZ'

Upvotes: 2

Lucas Claude
Lucas Claude

Reputation: 1

I don't know if it was possible back then but now you can specify multiple timestamp formats simultaneously like so:

"timestamp.formats" = "yyyy-MM-dd'T'HH:mm:ss.SSS'Z',yyyy-MM-dd'T'HH:mm:ss'Z'"

Upvotes: 0

Related Questions