Sumit Saurabh
Sumit Saurabh

Reputation: 1436

How to relationalize a JSON to flat structure in AWS Glue

Trying to flatten input JSON data having two map/dictionary fields (custom_event1 and custom_event2), which may contain any key-value pair data. In order to create an output table from the data frame, will have to avoid the flattening of custom_events and store it as JSON string in the column.

Following this doc, Relationalize.apply is flattening the custom_events map also.

Sample JSON:

{
    "id": "sklfsdfskdlfsdfsdfkhsdfssdf",
    "idtype": "cookieId",
    "event": "install",
    "sub_event": null,
    "ip": "XXXXXX",
    "geo": {
        "country": "IN",
        "city": null,
        "region": null
    },
    "carrier": {
        "operator": null,
        "network": null,
        "connection_type": null
    },
    "user_agent": "Mozilla/5.0",
    "device": {
        "brand": "LYF",
        "model": null,
        "type": null
    },
    "package": {
        "pkgName": "XXXXXXXX",
        "pkgVersion": "1.5.6.3",
        "pkgRating": null,
        "timestamp": "2017-12-14 11:51:27"
    },
    "custom_event1": {
        "key1": "value1",
        "key2": "value2"
    },
    "custom_event2": {
        "key": "value"
    }
}

How to store JSON data with a dynamic map field in a Relational storage?

Upvotes: 4

Views: 9996

Answers (2)

Haroon
Haroon

Reputation: 1110

The steps that you would need, assumption that JSON data is in S3

  1. Create a Crawler in AWS Glue and let it create a schema in a catalog (database). Assumption is that you are familiar with AWS Glue a little.

  2. Create a Glue job that transforms the JSON into your favorite format (parquet) that uses the transform step to flatten the data using Rationalize class - https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/ and writes to parquet format

  3. Create a crawler for the new flatten data and create the table in aws glue

  4. Use Athena or AWS Quick sight or your favorite BI tool to query parquet data

Upvotes: 3

Leila Bergamasco
Leila Bergamasco

Reputation: 1

I had a similar problem and create a classifier for JSON files following this documentation.

When I executed the crawler, I added the classifier and like as magic all the JSON structure was in a relation model to be stored in a Redshift cluster database.

Upvotes: 0

Related Questions