Missaka Wijekoon
Missaka Wijekoon

Reputation: 891

Create athena table with column as unstructured JSON from S3

I am currently creating an Athena table as follows:

 CREATE EXTERNAL TABLE `foo_streaming`(
  `type` string, 
  `message` struct<a:string,b:string,c:string>)
PARTITIONED BY ( 
  `dt` string)
ROW FORMAT SERDE 
  'org.apache.hive.hcatalog.data.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://foo/data'

However, instead of treating the message struct as structured data, I would like to read it as a JSON blob, because the data could change at any point. How do I do this with Athena?

I tried the following, but it gives me an error. I tried googling, but found nothing.

CREATE EXTERNAL TABLE `foo_streaming`(
      `type` string, 
      `message` JSON)
    PARTITIONED BY ( 
      `dt` string)
    ROW FORMAT SERDE 
      'org.apache.hive.hcatalog.data.JsonSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://foo/data'

Sample data from S3 would like like:

{ "type": "GTF", "message": { "a": 1, "b": 2 } }
{ "type": "GTB", "message": { "c": 1, "d": 2, "x": { "testid": "abc" } } }
{ "type": "GTE", "message": { "error_code": 1 } }

Upvotes: 1

Views: 2048

Answers (1)

Theo
Theo

Reputation: 132862

Use string as type, and then Athena/Presto's JSON functions to extract values from the blobs.

You can see this solution in action in the documentation for how to query CloudTrail logs with Athena. The requestparameters, and responseelements properties are JSON, but are service-dependent, and therefore can't be described with a struct.

json doesn't work as a type because it's not a type recognised by Hive, it's a Presto thing, IIRC. I find it pretty confusing in general when types are supported or not, or called different things depending on whether it's DDL or DML (e.g. string and varchar).

Upvotes: 2

Related Questions