jon
jon

Reputation: 439

How to extract data from specific fields in a NESTED JSON using AWS Athena - Presto?

I have JSONs in the below format in a S3 bucket and I'm trying to extract only the "id", "label" & "value" from the "fields" key using Athena. I tried ARRAY-MAP but wasn't successful. Also, on the "value" field - I want the content to be captured as a simple text ignoring any list / dictionaries in it.

I also don't want to create any Hive schema for these JSONs and looking for a Presto SQL solution if possible.

{  
    "reports":{  
        "client":{  
            "pdf":"https://reports.s3-accelerate.amazonaws.com/looks/123/reports/client.pdf",
            "html":"https://api.com/looks/123/reports/client.html"
        },
        "public":{  
            "pdf":"https://s3.amazonaws.com/reports.com/looks/123/reports/public.pdf",
            "html":"https://api.look.com/looks/123/reports/public.html"
        }
    },
    "actors":{  
        "looker":{  
            "firstName":"Rosa",
            "lastName":"Mart"
        },
        "client":{  
            "email":"[email protected]",
            "firstName":"XXX",
            "lastName":"XXX"
        }
    },
    "_id":"123",
    "fields":[  
                {  
        "id":"fence_condition_missing_sections",
        "context":[  
            "Fence Condition"
        ],
        "label":"Missing Sections",
        "type":"choice",
        "value":"None"
    },
        {  
            "id":"photos_landscaped_area",
            "context":[  
                "Landscaping Photos"
            ],
            "label":"Landscaped Area",
            "type":"photo-with-description",
            "value":[  
                {  
                    "description":"Front",
                    "photo":"https://reports-wegolook-com.s3-accelerate.amazonaws.com/looks/123/looker/1.jpg"
                },
                {  
                    "description":"Front entrance ",
                    "photo":"https://reports-wegolook-com.s3-accelerate.amazonaws.com/looks/123/looker/2.jpg"
                }
            ]
        }
    ],
    "jobNumber":"xxx",
    "createdAt":"2018-10-11T22:39:37.223Z",
    "completedAt":"2018-01-27T20:13:49.937Z",
    "inspectedAt":"2018-01-21T23:33:48.718Z",
    "type":"ZZZ-commercial",
    "name":"Commercial"
}'

expected output:

--------------------------------------------------------------------------------
| ID     | LABEL |  VALUE                                                  | 
--------------------------------------------------------------------------------
|   photos_landscaped_area     |  Landscaped Area |  [{"description":"Front",...}]    |
----------------------------------------------------------------------------
| fence_condition_missing_sections | Missing Sections | None|
----------------------------------------------------------------------------

Upvotes: 2

Views: 7424

Answers (1)

Theo
Theo

Reputation: 132862

I'm going to assume your data is in a one-document-per-line format and that you provided a formatted example for readability's sake. If this is incorrect, please see the question Multi-line JSON file querying in hive .

When the schema of a JSON document is not entirely regular you can create that column as a string column and use the JSON_* functions to extract values out of it.

First you need to create a table for the raw data:

CREATE TABLE data (
  fields array<struct<id:string,label:string,value:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://…'

(if you're not interested in the other fields in the JSON documents you can just ignore those when creating the table)

Then you create a view that flattens the data:

CREATE VIEW flat_data AS
SELECT
  field.id,
  field.label,
  field.value
FROM data
CROSS JOIN UNNEST(fields) AS f(field)

Selecting from this view should give you the results you are looking for.

I suspect you are also looking for how to extract properties from the values structure, which is what I alluded to above:

SELECT
  label,
  JSON_EXTRACT(value, '$.photo') AS photo_urls
FROM flat_data
WHERE id = 'photos_landscaped_area'

Look in the Presto documentation for all available JSON functions.

Upvotes: 3

Related Questions