Reputation: 419
I have received a data set from a client that is loaded in AWS S3. The data contains unnamed JSON key:value pairs. This isn't my area of expertise, so I was looking for a little help.
The structure of JSON data that I've typically worked with in the past looks similar to this:
{ "name":"John", "age":30, "car":null }
The data that I have received from my client is formatted as such:
{
"answer_id": "cc006",
"answer": {
"101086": 1,
"101087": 2,
"101089": 2,
"101090": 7,
"101091": 5,
"101092": 3,
"101125": 2
}
}
This is survey data, where the key on the left is a numeric customer identifier, and the value on the right is their response to a survey question, i.e. customer "101125" answered the survey with a value of "2". I need to be able to query the JSON data using Athena such that my result set looks similar to:
Cross joining the unnested children against the parent node isn't an issue. What I can't figure out is how to select all of the keys from the array "answer" without specifying that actual key name. Similarly, I want to be able to select all of the values as well.
Is it possible to create a virtual table in Athena that would allow for these results, or do I need to convert the JSON to a format this looks more similar to the following:
{
"answer_id": "cc006",
"answer": [
{ "key": "101086", "value": 1 },
{ "key": "101087", "value": 2 },
{ "key": "101089", "value": 2 },
{ "key": "101090", "value": 7 },
{ "key": "101091", "value": 5 },
{ "key": "101092", "value": 3 },
{ "key": "101125", "value": 2 }
]
}
EDIT 6/4/2020
I was able to use the code that Theon provided below along with the following table structure:
CREATE EXTERNAL TABLE answer_example (
answer_id string,
answer string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/'
That allowed me to use the following query to generate the results that I needed.
WITH Data AS(
SELECT
answer_id,
CAST(json_extract(answer, '$') AS MAP(VARCHAR, VARCHAR)) as answer
FROM
answer_example
)
SELECT
answer_id,
key,
element_at(answer, key) AS value
FROM
Data
CROSS JOIN UNNEST (map_keys(answer)) AS answer (key)
EDIT 6/5/2020
Taking additional advice from Theon's response below, the following DDL and Query simplify this quite a bit.
DDL:
CREATE EXTERNAL TABLE answer_example (
answer_id string,
answer map<string,string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/'
Query:
SELECT
answer_id,
key,
element_at(answer, key) AS value
FROM
answer_example
CROSS JOIN UNNEST (map_keys(answer)) AS answer (key)
Upvotes: 2
Views: 7550
Reputation: 132952
Cross joining with the keys of the answer
property and then picking the corresponding value. Something like this:
WITH data AS (
SELECT
'cc006' AS answer_id,
MAP(
ARRAY['101086', '101087', '101089', '101090', '101091', '101092', '101125'],
ARRAY[1, 2, 2, 7, 5, 3, 2]
) AS answers
)
SELECT
answer_id,
key,
element_at(answers, key) AS value
FROM data
CROSS JOIN UNNEST (map_keys(answers)) AS answer (key)
You could probably do something with transform_keys
to create rows of the key value pairs, but the SQL above does the trick.
Upvotes: 3