Reputation: 1712
Let's say I have the following DataFrame, where the data
column contains a nested JSON string that I want to parse into separate columns:
import pandas as pd
df = pd.DataFrame({
'bank_account': [101, 102, 201, 301],
'data': [
'{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
'{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0}, "fees": {"monthly": 0}}, "user_name": "Alice"}',
'{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Bob"}',
'{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Carol"}'
]},
index = ['Alice', 'Alice', 'Bob', 'Carol']
)
df
I've found the json_normalize
function, and am currently parsing the JSON in a list comprehension; the result is correct, but this takes long. 1000 rows take 1-2 seconds, and I have about a million rows in my real run:
import json
from pandas.io.json import json_normalize
parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])
parsed_df['bank_account'] = df['bank_account'].values
parsed_df.index = parsed_df['user_id']
parsed_df
Is there a faster way to parse this data into a nice-looking DataFrame?
Upvotes: 15
Views: 18041
Reputation: 516
Assuming that the JSON data is available in one big chunk rather than split up into individual strings, then using json.loads
, iterating through the results and creating dicts, and finally creating a DataFrame on the list of dicts works pretty well. With 150,000 rows, 30 original columns and 6 columns to be extracted into a new DataFrame, it completes in less than 1 second.
For example:
x = json.loads('[\
{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"},\
{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0}, "fees": {"monthly": 0}}, "user_name": "Alice"},\
{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Bob"},\
{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0}, "fees": {"monthly": 13.5}}, "user_name": "Carol"}]')
load_items = []
for item in x:
load_items.append({
'uid': item['uid'],
'account_type': item['account_type'],
'currency_current': item['account_data']['currency']['current'],
'currency_minimum': item['account_data']['currency']['minimum'],
'fees_monthly': item['account_data']['fees']['monthly'],
'user_name': item['user_name'],
})
y = pd.DataFrame(load_items)
y
index | uid | account_type | currency_current | currency_minimum | fees_monthly | user_name |
---|---|---|---|---|---|---|
0 | 100 | 1 | 1000 | -500 | 13.5 | Alice |
1 | 100 | 2 | 2000 | 0 | 0.0 | Alice |
2 | 200 | 1 | 3000 | 0 | 13.5 | Bob |
3 | 300 | 1 | 4000 | 0 | 13.5 | Carol |
(*thanks to Jupyter/Colab for the markdown table)
Upvotes: 0
Reputation: 29
To also automatically get the column names for the dataframe use this:
parsed_df = pd.DataFrame([json_normalize(json.loads(js)).values[0] for js in df['data']], columns=json_normalize(json.loads(js)).keys().tolist(), index=df.index)
Upvotes: 0
Reputation: 164683
I see a small (~25%) performance improvement from bypassing pandas.concat
.
Otherwise, rewriting / optimizing json_normalize
doesn't seem straightforward.
def original(df):
parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])
parsed_df['bank_account'] = df['bank_account'].values
parsed_df.index = parsed_df['uid']
return parsed_df
def jp(df):
cols = ['account_data.currency.current', 'account_data.currency.minimum',
'account_data.fees.monthly', 'account_type', 'uid', 'user_name']
parsed_df = pd.DataFrame([json_normalize(json.loads(js)).values[0] for js in df['data']],
columns=cols)
parsed_df['bank_account'] = df['bank_account'].values
parsed_df.index = parsed_df['uid']
return parsed_df
df = pd.concat([df]*100, ignore_index=True)
%timeit original(df) # 675 ms per loop
%timeit jp(df) # 526 ms per loop
Upvotes: 3