How to model nested json data on redshift to query specific neseted property

Question

I have the following JSON file structure on S3:

{
  "userId": "1234",
  "levelA": {
    "LevelB": [
      {
        "bssid": "University",
        "timestamp": "153301355685"
      },
      {
        "bssid": "Mall",
        "timestamp": "153301355688"
      }
    ]
  }
}

Now one of our future queries would be:

Return the total of users who saw bssid=University

So in my case it will return 1 (because userId=1234 contains that bssid's value)

Is Redshift the right solution for me for this type of query? In the case that it is, how can I model it?

John Rotenstein · Accepted Answer

The easiest way to model this would be to create a table with one row for each combination of userId and bssd:

userId, bssid, timestamp
1234,University,153301355685
1234,Mall,153301355688

The difficult part would be converting your JSON (contained in multiple files) into a suitable format for Redshift.

While Amazon Redshift can import data in JSON format, it would not handle the one-to-many relationship within your nested data.

Amazon Redshift also has a JSON_EXTRACT_PATH_TEXT Function that can extract data from a JSON string, but again it wouldn't handle the one-to-many relationship in your data.

I would recommend transforming your data into the above format prior to loading into Redshift. This would need to be done with an external script or ETL tool.

If you are frequently generating such files, a suitable method would be to trigger an AWS Lambda function whenever one of these files is stored in the Amazon S3 bucket. The Lambda function would then parse the file and output the CSV format, ready for loading into Redshift.

How to model nested json data on redshift to query specific neseted property

Answers (1)

Related Questions