lsr
lsr

Reputation: 19

How to flatten a nested json array?

I need to flatten a JSON with different levels of nested JSON arrays in Python

Part of my JSON looks like:

{
  "data": {
    "workbooks": [
      {
        "projectName": "TestProject",
        "name": "wkb1",
        "site": {
          "name": "site1"
        },
        "description": "",
        "createdAt": "2020-12-13T15:38:58Z",
        "updatedAt": "2020-12-13T15:38:59Z",
        "owner": {
          "name": "user1",
          "username": "John"
        },
        "embeddedDatasources": [
          {
            "name": "DS1",
            "hasExtracts": false,
            "upstreamDatasources": [
              {
                "projectName": "Data Sources",
                "name": "DS1",
                "hasExtracts": false,
                "owner": {
                  "username": "user2"
                }
              }
            ],
            "upstreamTables": [
              {
                "name": "table_1",
                "schema": "schema_1",
                "database": {
                  "name": "testdb",
                  "connectionType": "redshift"
                }
              },
              {
                "name": "table_2",
                "schema": "schema_2",
                "database": {
                  "name": "testdb",
                  "connectionType": "redshift"
                }
              },
              {
                "name": "table_3",
                "schema": "schema_3",
                "database": {
                  "name": "testdb",
                  "connectionType": "redshift"
                }
              }
            ]
          },
          {
            "name": "DS2",
            "hasExtracts": false,
            "upstreamDatasources": [
              {
                "projectName": "Data Sources",
                "name": "DS2",
                "hasExtracts": false,
                "owner": {
                  "username": "user3"
                }
              }
            ],
            "upstreamTables": [
              {
                "name": "table_4",
                "schema": "schema_1",
                "database": {
                  "name": "testdb",
                  "connectionType": "redshift"
                }
              }
            ]
          }
        ]
      }
    ]
  }
}

The output should like this

sample output

Tried using json_normalize but couldn't make it work. Currently parsing it by reading the nested arrays using loops and reading values using keys. Looking for a better way of normalizing the JSON

Upvotes: -3

Views: 2011

Answers (2)

Jagvir
Jagvir

Reputation: 17

tl;dr: Your final output along with detailed steps are mentioned in here

details :

To answer this question you need to have a thorough understanding of pandas.json_normalize. The understanding of json_normalize, record_path, meta, explode and in general json parsing.

import json
import pandas as pd

data = {
    "data":
    {
        "workbooks":
        [
            {
                "projectName": "TestProject",
                "name": "wkb1",
                "site":
                {
                    "name": "site1"
                },
                "description": "",
                "createdAt": "2020-12-13T15:38:58Z",
                "updatedAt": "2020-12-13T15:38:59Z",
                "owner":
                {
                    "name": "user1",
                    "username": "John"
                },
                "embeddedDatasources":
                [
                    {
                        "name": "DS1",
                        "hasExtracts": False,
                        "upstreamDatasources":
                        [
                            {
                                "projectName": "Data Sources",
                                "name": "DS1",
                                "hasExtracts": False,
                                "owner":
                                {
                                    "username": "user2"
                                }
                            }
                        ],
                        "upstreamTables":
                        [
                            {
                                "name": "table_1",
                                "schema": "schema_1",
                                "database":
                                {
                                    "name": "testdb",
                                    "connectionType": "redshift"
                                }
                            },
                            {
                                "name": "table_2",
                                "schema": "schema_2",
                                "database":
                                {
                                    "name": "testdb",
                                    "connectionType": "redshift"
                                }
                            },
                            {
                                "name": "table_3",
                                "schema": "schema_3",
                                "database":
                                {
                                    "name": "testdb",
                                    "connectionType": "redshift"
                                }
                            }
                        ]
                    },
                    {
                        "name": "DS2",
                        "hasExtracts": False,
                        "upstreamDatasources":
                        [
                            {
                                "projectName": "Data Sources",
                                "name": "DS2",
                                "hasExtracts": False,
                                "owner":
                                {
                                    "username": "user3"
                                }
                            }
                        ],
                        "upstreamTables":
                        [
                            {
                                "name": "table_4",
                                "schema": "schema_1",
                                "database":
                                {
                                    "name": "testdb",
                                    "connectionType": "redshift"
                                }
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

First you need to bring it to the dict level.

data_list = data['data']['workbooks']

I did some data massaging by renaming some columns as per requirements.

data_list_pd = pd.DataFrame(data_list)
data_list_pd = data_list_pd.rename(
    columns= {'name':'wkb'},errors='ignore').rename(
    columns= {'createdAt':'wkb_createdDt'},errors='ignore').rename(
    columns= {'updatedAt':'wkb_updatedDt'},errors='ignore').rename(
    columns= {'projectName':'prj'},errors='ignore')
data_list_pd

data_list = json.loads(data_list_pd.to_json(orient="records"))
data_list

Next is where the core of your problem statement lies. You need to flatten the JSON by mentioning the record_path which is esentially the nested dictionary you want to expand along with the meta which is meta data/the remaining columns which you want to display. After that you need to explode on columns which have lists in them. You can achieve it by chaining explode method couple of times.

flattened_dataframe= pd.json_normalize(data_list, 
                  record_path = 'embeddedDatasources',
                  meta = ['prj','wkb','wkb_createdDt', 'wkb_updatedDt',['site','name'],['owner','name'],['owner','username']],
                  errors='ignore').explode('upstreamDatasources').explode('upstreamTables')
flattened_dataframe 

You can repeat this process couple of times to reach your final goal/desired result. Since the json_normalize works on JSON/dict files you will have to convert the dataframe into json files after each iteration. You can follow these steps.

flattened_json = json.loads(flattened_dataframe.to_json(orient="records"))

Also read about to_json.

Upvotes: -1

thenarfer
thenarfer

Reputation: 455

Here's a partial solution:

First save your data in the same directory as the script as a JSON file called data.json.

import json
import pandas as pd
from pandas.io.json import json_normalize

with open('data.json') as json_file:
    json_data = json.load(json_file)

new_data = json_data['data']['workbooks']

result = json_normalize(new_data, ['embeddedDatasources', 'upstreamTables'], ['projectName', 'name', 'createdAt', 'updatedAt', 'owner', 'site'], record_prefix='_')

result 

Output:

_name _schema _database.name _database.connectionType projectName name createdAt updatedAt owner site
0 table_1 schema_1 testdb redshift TestProject wkb1 2020-12-13T15:38:58Z 2020-12-13T15:38:59Z {'name': 'user1', 'username': 'John'} {'name': 'site1'}
1 table_2 schema_2 testdb redshift TestProject wkb1 2020-12-13T15:38:58Z 2020-12-13T15:38:59Z {'name': 'user1', 'username': 'John'} {'name': 'site1'}
2 table_3 schema_3 testdb redshift TestProject wkb1 2020-12-13T15:38:58Z 2020-12-13T15:38:59Z {'name': 'user1', 'username': 'John'} {'name': 'site1'}
3 table_4 schema_1 testdb redshift TestProject wkb1 2020-12-13T15:38:58Z 2020-12-13T15:38:59Z {'name': 'user1', 'username': 'John'} {'name': 'site1'}

What next?

I think if you re-structure the data a bit in advance (for example flattening 'database': {'name': 'testdb', 'connectionType': 'redshift'}) you will be able to add more fields to the meta parameter.

As you see in the documentation of json_normalize, the four parameters that are used here are:

  1. data: dict or list of dicts :

    • Unserialized JSON objects.
  2. record_path: str or list of str : default None

    • Path in each object to list of records. If not passed, data will be assumed to be an array of records.
  3. meta: list of paths (str or list of str) : default None

    • Fields to use as metadata for each record in resulting table.
  4. record_prefix: str : default None

    • If True, prefix records with dotted (?) path, e.g. foo.bar.field if path to records is [‘foo’, ‘bar’].

Upvotes: 0

Related Questions