Reputation: 1712

How to parse a pandas column of JSON content efficiently?

Let's say I have the following DataFrame, where the data column contains a nested JSON string that I want to parse into separate columns:

import pandas as pd

df = pd.DataFrame({
    'bank_account': [101, 102, 201, 301],
    'data': [
        '{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
        '{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"}',
        '{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"}',        
        '{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}'        
    ]},
    index = ['Alice', 'Alice', 'Bob', 'Carol']
)


df

I've found the json_normalize function, and am currently parsing the JSON in a list comprehension; the result is correct, but this takes long. 1000 rows take 1-2 seconds, and I have about a million rows in my real run:

import json
from pandas.io.json import json_normalize

parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])

parsed_df['bank_account'] = df['bank_account'].values
parsed_df.index = parsed_df['user_id']

parsed_df

Is there a faster way to parse this data into a nice-looking DataFrame?

Upvotes: 15

Answers (3)

There

Reputation: 516

Assuming that the JSON data is available in one big chunk rather than split up into individual strings, then using json.loads, iterating through the results and creating dicts, and finally creating a DataFrame on the list of dicts works pretty well. With 150,000 rows, 30 original columns and 6 columns to be extracted into a new DataFrame, it completes in less than 1 second.

For example:

x = json.loads('[\
{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"},\
{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"},\
{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"},\
{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}]')
load_items = []
for item in x:
  load_items.append({
    'uid': item['uid'],
    'account_type': item['account_type'],
    'currency_current': item['account_data']['currency']['current'],
    'currency_minimum': item['account_data']['currency']['minimum'],
    'fees_monthly': item['account_data']['fees']['monthly'],
    'user_name': item['user_name'], 
  })
y = pd.DataFrame(load_items)
y

index	uid	account_type	currency_current	currency_minimum	fees_monthly	user_name
0	100	1	1000	-500	13.5	Alice
1	100	2	2000	0	0.0	Alice
2	200	1	3000	0	13.5	Bob
3	300	1	4000	0	13.5	Carol

(*thanks to Jupyter/Colab for the markdown table)

Upvotes: 0

Oliver K

Reputation: 29

To also automatically get the column names for the dataframe use this:

parsed_df = pd.DataFrame([json_normalize(json.loads(js)).values[0] for js in df['data']], columns=json_normalize(json.loads(js)).keys().tolist(), index=df.index)

Upvotes: 0

jpp

Reputation: 164683

I see a small (~25%) performance improvement from bypassing pandas.concat.

Otherwise, rewriting / optimizing json_normalize doesn't seem straightforward.

def original(df):
    parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])

    parsed_df['bank_account'] = df['bank_account'].values
    parsed_df.index = parsed_df['uid']

    return parsed_df

def jp(df):

    cols = ['account_data.currency.current', 'account_data.currency.minimum',
            'account_data.fees.monthly', 'account_type', 'uid', 'user_name']

    parsed_df = pd.DataFrame([json_normalize(json.loads(js)).values[0] for js in df['data']],
                             columns=cols)

    parsed_df['bank_account'] = df['bank_account'].values
    parsed_df.index = parsed_df['uid']

    return parsed_df

df = pd.concat([df]*100, ignore_index=True)

%timeit original(df)  # 675 ms per loop
%timeit jp(df)        # 526 ms per loop

Upvotes: 3

How to parse a pandas column of JSON content efficiently?

Answers (3)

Related Questions