Reputation: 1436
Trying to flatten input JSON data having two map/dictionary fields (custom_event1 and custom_event2), which may contain any key-value pair data. In order to create an output table from the data frame, will have to avoid the flattening of custom_events and store it as JSON string in the column.
Following this doc, Relationalize.apply is flattening the custom_events map also.
Sample JSON:
{
"id": "sklfsdfskdlfsdfsdfkhsdfssdf",
"idtype": "cookieId",
"event": "install",
"sub_event": null,
"ip": "XXXXXX",
"geo": {
"country": "IN",
"city": null,
"region": null
},
"carrier": {
"operator": null,
"network": null,
"connection_type": null
},
"user_agent": "Mozilla/5.0",
"device": {
"brand": "LYF",
"model": null,
"type": null
},
"package": {
"pkgName": "XXXXXXXX",
"pkgVersion": "1.5.6.3",
"pkgRating": null,
"timestamp": "2017-12-14 11:51:27"
},
"custom_event1": {
"key1": "value1",
"key2": "value2"
},
"custom_event2": {
"key": "value"
}
}
How to store JSON data with a dynamic map field in a Relational storage?
Upvotes: 4
Views: 9996
Reputation: 1110
The steps that you would need, assumption that JSON data is in S3
Create a Crawler in AWS Glue and let it create a schema in a catalog (database). Assumption is that you are familiar with AWS Glue a little.
Create a Glue job that transforms the JSON into your favorite format (parquet) that uses the transform step to flatten the data using Rationalize class - https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/ and writes to parquet format
Create a crawler for the new flatten data and create the table in aws glue
Use Athena or AWS Quick sight or your favorite BI tool to query parquet data
Upvotes: 3
Reputation: 1
I had a similar problem and create a classifier for JSON files following this documentation.
When I executed the crawler, I added the classifier and like as magic all the JSON structure was in a relation model to be stored in a Redshift cluster database.
Upvotes: 0