Reputation: 439
I have JSONs in the below format in a S3 bucket and I'm trying to extract only the "id", "label" & "value" from the "fields" key using Athena. I tried ARRAY-MAP but wasn't successful. Also, on the "value" field - I want the content to be captured as a simple text ignoring any list / dictionaries in it.
I also don't want to create any Hive schema for these JSONs and looking for a Presto SQL solution if possible.
{
"reports":{
"client":{
"pdf":"https://reports.s3-accelerate.amazonaws.com/looks/123/reports/client.pdf",
"html":"https://api.com/looks/123/reports/client.html"
},
"public":{
"pdf":"https://s3.amazonaws.com/reports.com/looks/123/reports/public.pdf",
"html":"https://api.look.com/looks/123/reports/public.html"
}
},
"actors":{
"looker":{
"firstName":"Rosa",
"lastName":"Mart"
},
"client":{
"email":"[email protected]",
"firstName":"XXX",
"lastName":"XXX"
}
},
"_id":"123",
"fields":[
{
"id":"fence_condition_missing_sections",
"context":[
"Fence Condition"
],
"label":"Missing Sections",
"type":"choice",
"value":"None"
},
{
"id":"photos_landscaped_area",
"context":[
"Landscaping Photos"
],
"label":"Landscaped Area",
"type":"photo-with-description",
"value":[
{
"description":"Front",
"photo":"https://reports-wegolook-com.s3-accelerate.amazonaws.com/looks/123/looker/1.jpg"
},
{
"description":"Front entrance ",
"photo":"https://reports-wegolook-com.s3-accelerate.amazonaws.com/looks/123/looker/2.jpg"
}
]
}
],
"jobNumber":"xxx",
"createdAt":"2018-10-11T22:39:37.223Z",
"completedAt":"2018-01-27T20:13:49.937Z",
"inspectedAt":"2018-01-21T23:33:48.718Z",
"type":"ZZZ-commercial",
"name":"Commercial"
}'
expected output:
--------------------------------------------------------------------------------
| ID | LABEL | VALUE |
--------------------------------------------------------------------------------
| photos_landscaped_area | Landscaped Area | [{"description":"Front",...}] |
----------------------------------------------------------------------------
| fence_condition_missing_sections | Missing Sections | None|
----------------------------------------------------------------------------
Upvotes: 2
Views: 7424
Reputation: 132862
I'm going to assume your data is in a one-document-per-line format and that you provided a formatted example for readability's sake. If this is incorrect, please see the question Multi-line JSON file querying in hive .
When the schema of a JSON document is not entirely regular you can create that column as a string
column and use the JSON_*
functions to extract values out of it.
First you need to create a table for the raw data:
CREATE TABLE data (
fields array<struct<id:string,label:string,value:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://…'
(if you're not interested in the other fields in the JSON documents you can just ignore those when creating the table)
Then you create a view that flattens the data:
CREATE VIEW flat_data AS
SELECT
field.id,
field.label,
field.value
FROM data
CROSS JOIN UNNEST(fields) AS f(field)
Selecting from this view should give you the results you are looking for.
I suspect you are also looking for how to extract properties from the values
structure, which is what I alluded to above:
SELECT
label,
JSON_EXTRACT(value, '$.photo') AS photo_urls
FROM flat_data
WHERE id = 'photos_landscaped_area'
Look in the Presto documentation for all available JSON functions.
Upvotes: 3