Reputation: 19
I need to flatten a JSON with different levels of nested JSON arrays in Python
Part of my JSON looks like:
{
"data": {
"workbooks": [
{
"projectName": "TestProject",
"name": "wkb1",
"site": {
"name": "site1"
},
"description": "",
"createdAt": "2020-12-13T15:38:58Z",
"updatedAt": "2020-12-13T15:38:59Z",
"owner": {
"name": "user1",
"username": "John"
},
"embeddedDatasources": [
{
"name": "DS1",
"hasExtracts": false,
"upstreamDatasources": [
{
"projectName": "Data Sources",
"name": "DS1",
"hasExtracts": false,
"owner": {
"username": "user2"
}
}
],
"upstreamTables": [
{
"name": "table_1",
"schema": "schema_1",
"database": {
"name": "testdb",
"connectionType": "redshift"
}
},
{
"name": "table_2",
"schema": "schema_2",
"database": {
"name": "testdb",
"connectionType": "redshift"
}
},
{
"name": "table_3",
"schema": "schema_3",
"database": {
"name": "testdb",
"connectionType": "redshift"
}
}
]
},
{
"name": "DS2",
"hasExtracts": false,
"upstreamDatasources": [
{
"projectName": "Data Sources",
"name": "DS2",
"hasExtracts": false,
"owner": {
"username": "user3"
}
}
],
"upstreamTables": [
{
"name": "table_4",
"schema": "schema_1",
"database": {
"name": "testdb",
"connectionType": "redshift"
}
}
]
}
]
}
]
}
}
The output should like this
Tried using json_normalize
but couldn't make it work. Currently parsing it by reading the nested arrays using loops and reading values using keys. Looking for a better way of normalizing the JSON
Upvotes: -3
Views: 2011
Reputation: 17
tl;dr: Your final output along with detailed steps are mentioned in here
details :
To answer this question you need to have a thorough understanding of pandas.json_normalize. The understanding of json_normalize, record_path, meta, explode and in general json parsing.
import json
import pandas as pd
data = {
"data":
{
"workbooks":
[
{
"projectName": "TestProject",
"name": "wkb1",
"site":
{
"name": "site1"
},
"description": "",
"createdAt": "2020-12-13T15:38:58Z",
"updatedAt": "2020-12-13T15:38:59Z",
"owner":
{
"name": "user1",
"username": "John"
},
"embeddedDatasources":
[
{
"name": "DS1",
"hasExtracts": False,
"upstreamDatasources":
[
{
"projectName": "Data Sources",
"name": "DS1",
"hasExtracts": False,
"owner":
{
"username": "user2"
}
}
],
"upstreamTables":
[
{
"name": "table_1",
"schema": "schema_1",
"database":
{
"name": "testdb",
"connectionType": "redshift"
}
},
{
"name": "table_2",
"schema": "schema_2",
"database":
{
"name": "testdb",
"connectionType": "redshift"
}
},
{
"name": "table_3",
"schema": "schema_3",
"database":
{
"name": "testdb",
"connectionType": "redshift"
}
}
]
},
{
"name": "DS2",
"hasExtracts": False,
"upstreamDatasources":
[
{
"projectName": "Data Sources",
"name": "DS2",
"hasExtracts": False,
"owner":
{
"username": "user3"
}
}
],
"upstreamTables":
[
{
"name": "table_4",
"schema": "schema_1",
"database":
{
"name": "testdb",
"connectionType": "redshift"
}
}
]
}
]
}
]
}
}
First you need to bring it to the dict level.
data_list = data['data']['workbooks']
I did some data massaging by renaming some columns as per requirements.
data_list_pd = pd.DataFrame(data_list)
data_list_pd = data_list_pd.rename(
columns= {'name':'wkb'},errors='ignore').rename(
columns= {'createdAt':'wkb_createdDt'},errors='ignore').rename(
columns= {'updatedAt':'wkb_updatedDt'},errors='ignore').rename(
columns= {'projectName':'prj'},errors='ignore')
data_list_pd
data_list = json.loads(data_list_pd.to_json(orient="records"))
data_list
Next is where the core of your problem statement lies. You need to flatten the JSON by mentioning the record_path which is esentially the nested dictionary you want to expand along with the meta which is meta data/the remaining columns which you want to display. After that you need to explode on columns which have lists in them. You can achieve it by chaining explode method couple of times.
flattened_dataframe= pd.json_normalize(data_list,
record_path = 'embeddedDatasources',
meta = ['prj','wkb','wkb_createdDt', 'wkb_updatedDt',['site','name'],['owner','name'],['owner','username']],
errors='ignore').explode('upstreamDatasources').explode('upstreamTables')
flattened_dataframe
You can repeat this process couple of times to reach your final goal/desired result. Since the json_normalize works on JSON/dict files you will have to convert the dataframe into json files after each iteration. You can follow these steps.
flattened_json = json.loads(flattened_dataframe.to_json(orient="records"))
Also read about to_json.
Upvotes: -1
Reputation: 455
Here's a partial solution:
First save your data in the same directory as the script as a JSON file
called data.json
.
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('data.json') as json_file:
json_data = json.load(json_file)
new_data = json_data['data']['workbooks']
result = json_normalize(new_data, ['embeddedDatasources', 'upstreamTables'], ['projectName', 'name', 'createdAt', 'updatedAt', 'owner', 'site'], record_prefix='_')
result
Output:
_name | _schema | _database.name | _database.connectionType | projectName | name | createdAt | updatedAt | owner | site | |
---|---|---|---|---|---|---|---|---|---|---|
0 | table_1 | schema_1 | testdb | redshift | TestProject | wkb1 | 2020-12-13T15:38:58Z | 2020-12-13T15:38:59Z | {'name': 'user1', 'username': 'John'} | {'name': 'site1'} |
1 | table_2 | schema_2 | testdb | redshift | TestProject | wkb1 | 2020-12-13T15:38:58Z | 2020-12-13T15:38:59Z | {'name': 'user1', 'username': 'John'} | {'name': 'site1'} |
2 | table_3 | schema_3 | testdb | redshift | TestProject | wkb1 | 2020-12-13T15:38:58Z | 2020-12-13T15:38:59Z | {'name': 'user1', 'username': 'John'} | {'name': 'site1'} |
3 | table_4 | schema_1 | testdb | redshift | TestProject | wkb1 | 2020-12-13T15:38:58Z | 2020-12-13T15:38:59Z | {'name': 'user1', 'username': 'John'} | {'name': 'site1'} |
What next?
I think if you re-structure the data a bit in advance (for example flattening 'database': {'name': 'testdb', 'connectionType': 'redshift'}
) you will be able to add more fields
to the meta
parameter.
As you see in the documentation of json_normalize, the four parameters that are used here are:
data: dict or list of dicts
:
record_path: str or list of str
: default None
meta: list of paths (str or list of str)
: default None
record_prefix: str
: default None
Upvotes: 0