Reputation: 891
I am currently creating an Athena table as follows:
CREATE EXTERNAL TABLE `foo_streaming`(
`type` string,
`message` struct<a:string,b:string,c:string>)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://foo/data'
However, instead of treating the message struct as structured data, I would like to read it as a JSON blob, because the data could change at any point. How do I do this with Athena?
I tried the following, but it gives me an error. I tried googling, but found nothing.
CREATE EXTERNAL TABLE `foo_streaming`(
`type` string,
`message` JSON)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://foo/data'
Sample data from S3 would like like:
{ "type": "GTF", "message": { "a": 1, "b": 2 } }
{ "type": "GTB", "message": { "c": 1, "d": 2, "x": { "testid": "abc" } } }
{ "type": "GTE", "message": { "error_code": 1 } }
Upvotes: 1
Views: 2048
Reputation: 132862
Use string
as type, and then Athena/Presto's JSON functions to extract values from the blobs.
You can see this solution in action in the documentation for how to query CloudTrail logs with Athena. The requestparameters
, and responseelements
properties are JSON, but are service-dependent, and therefore can't be described with a struct.
json
doesn't work as a type because it's not a type recognised by Hive, it's a Presto thing, IIRC. I find it pretty confusing in general when types are supported or not, or called different things depending on whether it's DDL or DML (e.g. string
and varchar
).
Upvotes: 2