user2355903
user2355903

Reputation: 643

Parsing Nested JSON to one data file

I am trying to parse a nested json.

I've got the dataset stored here so that you can see what I'm seeing specifically if you want: https://mega.nz/file/YWNSRBjK#V9DpoY5LSp-VL8Mnu7NEfNf3FhDOCj9FHBiTQ4KHEa8

I am attempting to parse this using pandas json_normalize function. Below is what my code looks like in it's entirety.

import gzip   
import shutil
import json
import pandas as pd

with gzip.open('testjson.json.gz', 'rb') as f_in:
    with open('unzipped_json.json', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

f = open('unzipped_json.json')
data = json.load(f)
keys = data.keys()
keys_string = list(keys)
 
### In Network
in_network_df = pd.json_normalize(data['in_network'])

### Negotiated Rates
negotiated_rates_df = pd.json_normalize(data=data['in_network'],
                                        record_path=("negotiated_rates"))
negotiated_rates_df = negotiated_rates_df.explode('provider_references')
negotiated_rates_df = negotiated_rates_df.explode('negotiated_prices')

### Negotiated Prices
negotiated_prices_df = pd.json_normalize(data=data['in_network'],
                                         meta=[
                                             #['negotiated_rates','provider_references'],
                                            # ['negotiation_arrangement', 'name','billing_code_type','billing_code','description']
                                             ],
                                        record_path=['negotiated_rates','negotiated_prices'],
                                        errors='ignore')
negotiated_prices_df = negotiated_prices_df.explode('service_code')

### Provider References
provider_references_df = pd.json_normalize(data['provider_references'])
provider_references_test = provider_references_df.explode('provider_groups')

### Provider Groups
provider_groups = pd.json_normalize(data=data['provider_references'],
                                    meta=['provider_group_id'],
                                        record_path=("provider_groups"))
provider_groups = provider_groups.explode('npi')

I am specifically having trouble with the negotiated prices part of this json object. I am trying to add in some data from parent objects, but it is giving me an error. To point out specifically what I would like to do here it is below.

negotiated_prices_df = pd.json_normalize(data=data['in_network'],
                                         meta=['provider_references'],
                                        record_path=['negotiated_rates','negotiated_prices'],
                                        errors='ignore')

When I try to do this I get ValueError: operands could not be broadcast together with shape (74607,) (24869,)

Can anyone help me understand what is going on here?

Edit: Trying to provide some more context in case someone is not wanting to open my file... Here is one spot showing the problematic portion I'm dealing with in the JSON. I can't seem to get the provider_references to attach to any of the child objects.

"provider_references":[261, 398, 799],"negotiated_prices":[{"negotiated_type": "fee schedule","negotiated_rate": 296.00,"expiration_date": "2023-06-30","service_code": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13",

Upvotes: 2

Views: 192

Answers (1)

Josh Friedlander
Josh Friedlander

Reputation: 11657

I think the code that you want looks like this:

with open('unzipped_json.json') as f:
    data = json.load(f)

negotiated_rates_and_prices_df = pd.json_normalize(
    data["in_network"],
    record_path=["negotiated_rates", ["negotiated_prices"]],
    meta=[
        "negotiation_arrangement",
        "name",
        "billing_code_type",
        "billing_code_type_version",
        "billing_code",
        "description",
        ["negotiated_rates", "provider_references"],
    ],
)

That takes care of the in_network part of the JSON. The trick is that within the metadata path you want to put the columns which are not nested in a regular list, and the nested ones in the order of nesting (ie ["negotiated_rates", "provider_references"]). There's a similar example in the docs here.

Then for the other nested part of the JSON you can do this:

provider_references_df = pd.json_normalize(
    data["provider_references"], "provider_groups", "provider_group_id"
)

And that takes care of the whole thing.

Upvotes: 2

Related Questions