Reputation: 33

Normalizing json using pandas with inconsistent nested lists/dictionaries

I've been using pandas' json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109

I'm trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn't prevent the TypeError that stems from trying to iterate through a null value.

Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!

import json
import pandas as pd

data = """[
    {
        "ID": "100",
        "Ats": {
            "Ats": [
                {
                    "Name": "At1",
                    "Desc": "Lazy At"
                }
            ]
        }
    },
    {
        "ID": "101",
        "Ats": null
    }
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()

TypeError: 'NoneType' object is not iterable

I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.

The desired output would look like this: Dataframe

Upvotes: 2

Answers (2)

Jamiu S.

Reputation: 5741

This approach can be more efficient when it comes to dealing with large datasets.

data = json.loads(data)
desired_data = list(
    map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
    if x["Ats"] is not None
    else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
    
df = pd.DataFrame(desired_data)

Output:

  Name     Desc   ID
0  At1  Lazy At  100
1  NaN      NaN  101

You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.

Example:

data = json.loads(data)
df = pd.DataFrame()
for item in data:
    try:
        df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID")) 
    except TypeError:
        df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)

print(df)

Output:

  Name     Desc   ID
0  At1  Lazy At  100
1  NaN      NaN  101

Upvotes: 0

Andrej Kesely

Reputation: 195653

Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:

import json

import pandas as pd

data = """\
[
    {
        "ID": "100",
        "Ats": {
            "Ats": [
                {
                    "Name": "At1",
                    "Desc": "Lazy At"
                }
            ]
        }
    },
    {
        "ID": "101",
        "Ats": null
    }
]"""

data = json.loads(data)

df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")

df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)

Prints:

    ID Name     Desc
0  100  At1  Lazy At
1  101  NaN      NaN

Upvotes: 0

Normalizing json using pandas with inconsistent nested lists/dictionaries

Answers (2)

Related Questions