Accessing nested data in JSON file to build multiple dataframes

Question

I'm currently accessing a JSON file to gather information about companies, branches and workers. In this JSON there are multiple companies, in which every company has its own branches and each branch has its own workers. Below there's an example of the JSON structure I'm working on.

{
  "company": [
    {
      "companyName": "MyCompany",
      "branches": [
        {
          "branchName": "My First Branch",
          "workers": [
            {
              "workerName": "John",
              "wage": "10"
            },
            {
              "workerName": "Lucas",
              "wage": "20"
            }
          ]
        },
        {
          "branchName": "My Second Branch",
          "workers": [
            {
              "workerName": "Mary",
              "wage": "30"
            },
            {
              "workerName": "Jack",
              "wage": "40"
            }
          ]
        }
      ]
    },
    {
      "companyName": "YourCompany",
      "branches": [
        {
          "branchName": "Your First Branch",
          "workers": [
            {
              "workerName": "George",
              "wage": "15"
            },
            {
              "workerName": "Harry",
              "wage": "25"
            }
          ]
        },
        {
          "branchName": "Your Second Branch",
          "workers": [
            {
              "workerName": "Wayne",
              "wage": "35"
            },
            {
              "workerName": "Rose",
              "wage": "45"
            }
          ]
        }
      ]
    }
  ]
}

My goal is to gather informations regarding all companies, branches and workers in one dataframe for each of them. For that, I'm currently using loops, as shown below.

Companies = my_json['Companies'] #accessing list of companies
Branches = None
Workers = None
for i in range(len(Companies)):
  company_branches = Companies[i]['Branches'] #accessing branches for that company
  if(Branches is None):
    Branches = pd.DataFrame(company_branches)
  else:
    Branches = pd.concat([Branches,pd.DataFrame(company_branches)])
  for j in range(len(company_branches)):
    branch_workers = company_branches[j]['Workers'] #accessing workers for that branch
    if(Workers is None):
      Workers = pd.DataFrame(branch_workers)
    else:
      Workers = pd.concat([Workers,pd.DataFrame(branch_workers)])
Companies = pd.DataFrame(Companies)

This solves my problem, creating the three desired dataframes (with some extra columns for companies and branches which I would still drop), but I'm currently having performance issues. I've been trying to solve this problem without loops, but I can't create the dataframes correctly. If I try

Companies = pd.DataFrame(my_json['companies'])

it creates the dataframe correctly, but then if I try

Branches = pd.DataFrame(Companies.branches.values)

it isn't created correctly. It doesn't raise an error, but it's basically copying the column from the Companies dataframe, creating a single column with the JSON code for the branches for each company. In this case, what I wanted was as many columns as the amount of attributes the branches have.

Any tips on how to solve this problem in an efficient way?

Alexander · Accepted Answer

You should probably keep all of your data in one dataframe (tidy data). If you need, you could then get a dataframe for companies or branches, e.g. pd.DataFrame(df.company.unique(), columns='company name').

The method below uses a nested list comprehension to flatten your data. It also converts the wage record from a string to a float.

df = pd.DataFrame(
    [(company.get('companyName'), branch.get('branchName'), 
      worker.get('workerName'), float(worker.get('wage', 0))) 
     for company in my_json['company']
     for branch in company['branches'] 
     for worker in branch.get('workers')
], columns=['company', 'branch', 'worker', 'wage'])

>>> df
       company              branch  worker  wage
0    MyCompany     My First Branch    John  10.0
1    MyCompany     My First Branch   Lucas  20.0
2    MyCompany    My Second Branch    Mary  30.0
3    MyCompany    My Second Branch    Jack  40.0
4  YourCompany   Your First Branch  George  15.0
5  YourCompany   Your First Branch   Harry  25.0
6  YourCompany  Your Second Branch   Wayne  35.0
7  YourCompany  Your Second Branch    Rose  45.0

Accessing nested data in JSON file to build multiple dataframes

Answers (2)

Related Questions