rhholt
rhholt

Reputation: 419

Query JSON Key:Value Pairs in AWS Athena

I have received a data set from a client that is loaded in AWS S3. The data contains unnamed JSON key:value pairs. This isn't my area of expertise, so I was looking for a little help.

The structure of JSON data that I've typically worked with in the past looks similar to this:

{ "name":"John", "age":30, "car":null }

The data that I have received from my client is formatted as such:

{
  "answer_id": "cc006",
  "answer": {
    "101086": 1,
    "101087": 2,
    "101089": 2,
    "101090": 7,
    "101091": 5,
    "101092": 3,
    "101125": 2
  }
}

This is survey data, where the key on the left is a numeric customer identifier, and the value on the right is their response to a survey question, i.e. customer "101125" answered the survey with a value of "2". I need to be able to query the JSON data using Athena such that my result set looks similar to:

enter image description here

Cross joining the unnested children against the parent node isn't an issue. What I can't figure out is how to select all of the keys from the array "answer" without specifying that actual key name. Similarly, I want to be able to select all of the values as well.

Is it possible to create a virtual table in Athena that would allow for these results, or do I need to convert the JSON to a format this looks more similar to the following:

{
  "answer_id": "cc006",
  "answer": [
    { "key": "101086", "value": 1 },
    { "key": "101087", "value": 2 },
    { "key": "101089", "value": 2 },
    { "key": "101090", "value": 7 },
    { "key": "101091", "value": 5 },
    { "key": "101092", "value": 3 },
    { "key": "101125", "value": 2 }
  ]
}

EDIT 6/4/2020

I was able to use the code that Theon provided below along with the following table structure:

CREATE EXTERNAL TABLE answer_example (
  answer_id string,
  answer string 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/'

That allowed me to use the following query to generate the results that I needed.

WITH Data AS(
SELECT 
  answer_id, 
  CAST(json_extract(answer, '$') AS MAP(VARCHAR, VARCHAR)) as answer
FROM
  answer_example
)
SELECT 
  answer_id,
  key,
  element_at(answer, key) AS value
FROM 
  Data
CROSS JOIN UNNEST (map_keys(answer)) AS answer (key)

EDIT 6/5/2020

Taking additional advice from Theon's response below, the following DDL and Query simplify this quite a bit.

DDL:

CREATE EXTERNAL TABLE answer_example (
  answer_id string,
  answer map<string,string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/'

Query:

SELECT 
  answer_id,
  key,
  element_at(answer, key) AS value
FROM 
  answer_example
CROSS JOIN UNNEST (map_keys(answer)) AS answer (key)

Upvotes: 2

Views: 7550

Answers (1)

Theo
Theo

Reputation: 132952

Cross joining with the keys of the answer property and then picking the corresponding value. Something like this:

WITH data AS (
  SELECT
    'cc006' AS answer_id,
    MAP(
      ARRAY['101086', '101087', '101089', '101090', '101091', '101092', '101125'],
      ARRAY[1, 2, 2, 7, 5, 3, 2]
    ) AS answers
)
SELECT
  answer_id,
  key,
  element_at(answers, key) AS value
FROM data
CROSS JOIN UNNEST (map_keys(answers)) AS answer (key) 

You could probably do something with transform_keys to create rows of the key value pairs, but the SQL above does the trick.

Upvotes: 3

Related Questions