Reputation: 33
I'm currently accessing a JSON file to gather information about companies, branches and workers. In this JSON there are multiple companies, in which every company has its own branches and each branch has its own workers. Below there's an example of the JSON structure I'm working on.
{
"company": [
{
"companyName": "MyCompany",
"branches": [
{
"branchName": "My First Branch",
"workers": [
{
"workerName": "John",
"wage": "10"
},
{
"workerName": "Lucas",
"wage": "20"
}
]
},
{
"branchName": "My Second Branch",
"workers": [
{
"workerName": "Mary",
"wage": "30"
},
{
"workerName": "Jack",
"wage": "40"
}
]
}
]
},
{
"companyName": "YourCompany",
"branches": [
{
"branchName": "Your First Branch",
"workers": [
{
"workerName": "George",
"wage": "15"
},
{
"workerName": "Harry",
"wage": "25"
}
]
},
{
"branchName": "Your Second Branch",
"workers": [
{
"workerName": "Wayne",
"wage": "35"
},
{
"workerName": "Rose",
"wage": "45"
}
]
}
]
}
]
}
My goal is to gather informations regarding all companies, branches and workers in one dataframe for each of them. For that, I'm currently using loops, as shown below.
Companies = my_json['Companies'] #accessing list of companies
Branches = None
Workers = None
for i in range(len(Companies)):
company_branches = Companies[i]['Branches'] #accessing branches for that company
if(Branches is None):
Branches = pd.DataFrame(company_branches)
else:
Branches = pd.concat([Branches,pd.DataFrame(company_branches)])
for j in range(len(company_branches)):
branch_workers = company_branches[j]['Workers'] #accessing workers for that branch
if(Workers is None):
Workers = pd.DataFrame(branch_workers)
else:
Workers = pd.concat([Workers,pd.DataFrame(branch_workers)])
Companies = pd.DataFrame(Companies)
This solves my problem, creating the three desired dataframes (with some extra columns for companies and branches which I would still drop), but I'm currently having performance issues. I've been trying to solve this problem without loops, but I can't create the dataframes correctly. If I try
Companies = pd.DataFrame(my_json['companies'])
it creates the dataframe correctly, but then if I try
Branches = pd.DataFrame(Companies.branches.values)
it isn't created correctly. It doesn't raise an error, but it's basically copying the column from the Companies dataframe, creating a single column with the JSON code for the branches for each company. In this case, what I wanted was as many columns as the amount of attributes the branches have.
Any tips on how to solve this problem in an efficient way?
Upvotes: 2
Views: 666
Reputation: 109526
You should probably keep all of your data in one dataframe (tidy data). If you need, you could then get a dataframe for companies or branches, e.g. pd.DataFrame(df.company.unique(), columns='company name')
.
The method below uses a nested list comprehension to flatten your data. It also converts the wage
record from a string to a float.
df = pd.DataFrame(
[(company.get('companyName'), branch.get('branchName'),
worker.get('workerName'), float(worker.get('wage', 0)))
for company in my_json['company']
for branch in company['branches']
for worker in branch.get('workers')
], columns=['company', 'branch', 'worker', 'wage'])
>>> df
company branch worker wage
0 MyCompany My First Branch John 10.0
1 MyCompany My First Branch Lucas 20.0
2 MyCompany My Second Branch Mary 30.0
3 MyCompany My Second Branch Jack 40.0
4 YourCompany Your First Branch George 15.0
5 YourCompany Your First Branch Harry 25.0
6 YourCompany Your Second Branch Wayne 35.0
7 YourCompany Your Second Branch Rose 45.0
Upvotes: 1
Reputation: 18367
This is my solution to the problem in a generalized way. Given I define the whole json
your provide as data
, then:
to_df = {'companyName':[],'branchName':[],'workerName':[],'wage':[]}
for i in range(len(data['company'])):
for j in range(len(data['company'][i]['branches'])):
for k in range(len(data['company'][i]['branches'][j]['workers'])):
to_df['companyName'].append(data['company'][i]['companyName'])
to_df['branchName'].append(data['company'][i]['branches'][j]['branchName'])
to_df['workerName'].append(data['company'][i]['branches'][j]['workers'][k]['workerName'])
to_df['wage'].append(data['company'][i]['branches'][j]['workers'][k]['wage'])
df = pd.DataFrame(to_df)
print(df)
Output:
companyName branchName workerName wage
0 MyCompany My First Branch John 10
1 MyCompany My First Branch Lucas 20
2 MyCompany My Second Branch Mary 30
3 MyCompany My Second Branch Jack 40
4 YourCompany Your First Branch George 15
5 YourCompany Your First Branch Harry 25
6 YourCompany Your Second Branch Wayne 35
7 YourCompany Your Second Branch Rose 45
You can check this answer for some extra information regarding nested jsons: How to extract nested JSON data?
Upvotes: 0