Reputation: 643
I am trying to parse a nested json.
I've got the dataset stored here so that you can see what I'm seeing specifically if you want: https://mega.nz/file/YWNSRBjK#V9DpoY5LSp-VL8Mnu7NEfNf3FhDOCj9FHBiTQ4KHEa8
I am attempting to parse this using pandas json_normalize function. Below is what my code looks like in it's entirety.
import gzip
import shutil
import json
import pandas as pd
with gzip.open('testjson.json.gz', 'rb') as f_in:
with open('unzipped_json.json', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
f = open('unzipped_json.json')
data = json.load(f)
keys = data.keys()
keys_string = list(keys)
### In Network
in_network_df = pd.json_normalize(data['in_network'])
### Negotiated Rates
negotiated_rates_df = pd.json_normalize(data=data['in_network'],
record_path=("negotiated_rates"))
negotiated_rates_df = negotiated_rates_df.explode('provider_references')
negotiated_rates_df = negotiated_rates_df.explode('negotiated_prices')
### Negotiated Prices
negotiated_prices_df = pd.json_normalize(data=data['in_network'],
meta=[
#['negotiated_rates','provider_references'],
# ['negotiation_arrangement', 'name','billing_code_type','billing_code','description']
],
record_path=['negotiated_rates','negotiated_prices'],
errors='ignore')
negotiated_prices_df = negotiated_prices_df.explode('service_code')
### Provider References
provider_references_df = pd.json_normalize(data['provider_references'])
provider_references_test = provider_references_df.explode('provider_groups')
### Provider Groups
provider_groups = pd.json_normalize(data=data['provider_references'],
meta=['provider_group_id'],
record_path=("provider_groups"))
provider_groups = provider_groups.explode('npi')
I am specifically having trouble with the negotiated prices part of this json object. I am trying to add in some data from parent objects, but it is giving me an error. To point out specifically what I would like to do here it is below.
negotiated_prices_df = pd.json_normalize(data=data['in_network'],
meta=['provider_references'],
record_path=['negotiated_rates','negotiated_prices'],
errors='ignore')
When I try to do this I get ValueError: operands could not be broadcast together with shape (74607,) (24869,)
Can anyone help me understand what is going on here?
Edit: Trying to provide some more context in case someone is not wanting to open my file... Here is one spot showing the problematic portion I'm dealing with in the JSON. I can't seem to get the provider_references to attach to any of the child objects.
"provider_references":[261, 398, 799],"negotiated_prices":[{"negotiated_type": "fee schedule","negotiated_rate": 296.00,"expiration_date": "2023-06-30","service_code": ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13",
Upvotes: 2
Views: 192
Reputation: 11657
I think the code that you want looks like this:
with open('unzipped_json.json') as f:
data = json.load(f)
negotiated_rates_and_prices_df = pd.json_normalize(
data["in_network"],
record_path=["negotiated_rates", ["negotiated_prices"]],
meta=[
"negotiation_arrangement",
"name",
"billing_code_type",
"billing_code_type_version",
"billing_code",
"description",
["negotiated_rates", "provider_references"],
],
)
That takes care of the in_network
part of the JSON. The trick is that within the metadata path you want to put the columns which are not nested in a regular list, and the nested ones in the order of nesting (ie ["negotiated_rates", "provider_references"]
). There's a similar example in the docs here.
Then for the other nested part of the JSON you can do this:
provider_references_df = pd.json_normalize(
data["provider_references"], "provider_groups", "provider_group_id"
)
And that takes care of the whole thing.
Upvotes: 2