Michael
Michael

Reputation: 308

JSON object inside Pandas DataFrame

I have a JSON object inside a pandas dataframe column, which I want to pull apart and put into other columns. In the dataframe, the JSON object looks like a string containing an array of dictionaries. The array can be of variable length, including zero, or the column can even be null. I've written some code, shown below, which does what I want. The column names are built from two components, the first being the keys in the dictionaries, and the second being a substring from a key value in the dictionary.

This code works okay, but it is very slow when running on a big dataframe. Can anyone offer a faster (and probably simpler) way to do this? Also, feel free to pick holes in what I have done if you see something which is not sensible/efficient/pythonic. I'm still a relative beginner. Thanks heaps.

# Import libraries 
import pandas as pd
from IPython.display import display # Used to display df's nicely in jupyter notebook.
import json

# Set some display options
pd.set_option('max_colwidth',150)

# Create the example dataframe
print("Original df:")
df = pd.DataFrame.from_dict({'ColA': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567},\
 'ColB': {0: '[{"key":"keyValue=1","valA":"8","valB":"18"},{"key":"keyValue=2","valA":"9","valB":"19"}]',\
  1: '[{"key":"keyValue=2","valA":"28","valB":"38"},{"key":"keyValue=3","valA":"29","valB":"39"}]',\
  2: '[{"key":"keyValue=4","valA":"48","valC":"58"}]',\
  3: '[]',\
  4: None}})
display(df)

# Create a temporary dataframe to append results to, record by record
dfTemp = pd.DataFrame()

# Step through all rows in the dataframe
for i in range(df.shape[0]):
    # Check whether record is null, or doesn't contain any real data
    if pd.notnull(df.iloc[i,df.columns.get_loc("ColB")]) and len(df.iloc[i,df.columns.get_loc("ColB")]) > 2:
        # Convert the json structure into a dataframe, one cell at a time in the relevant column
        x = pd.read_json(df.iloc[i,df.columns.get_loc("ColB")])
        # The last bit of this string (after the last =) will be used as a key for the column labels
        x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
        # Set this new key to be the index
        y = x.set_index('key')
        # Stack the rows up via a multi-level column index
        y = y.stack().to_frame().T
        # Flatten out the multi-level column index
        y.columns = ['{1}_{0}'.format(*c) for c in y.columns]
        # Give the single record the same index number as the parent dataframe (for the merge to work)
        y.index = [df.index[i]]
        # Append this dataframe on sequentially for each row as we go through the loop
        dfTemp = dfTemp.append(y)

# Merge the new dataframe back onto the original one as extra columns, with index mataching original dataframe
df = pd.merge(df,dfTemp, how = 'left', left_index = True, right_index = True)

print("Processed df:")
display(df)

Upvotes: 3

Views: 7357

Answers (1)

greg_data
greg_data

Reputation: 2293

First, a general piece of advice about pandas. If you find yourself iterating over the rows of a dataframe, you are most likely doing it wrong.

With this in mind, we can re-write your current procedure using pandas 'apply' method (this will likely speed it up to begin with, as it means far fewer index lookups on the df):

# Check whether record is null, or doesn't contain any real data
def do_the_thing(row):
    if pd.notnull(row) and len(row) > 2:
        # Convert the json structure into a dataframe, one cell at a time in the relevant column
        x = pd.read_json(row)
        # The last bit of this string (after the last =) will be used as a key for the column labels
        x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
        # Set this new key to be the index
        y = x.set_index('key')
        # Stack the rows up via a multi-level column index
        y = y.stack().to_frame().T
        # Flatten out the multi-level column index
        y.columns = ['{1}_{0}'.format(*c) for c in y.columns]

        #we don't need to re-index
            # Give the single record the same index number as the parent dataframe (for the merge to work)
            #y.index = [df.index[i]]
        #we don't need to add to a temp df
        # Append this dataframe on sequentially for each row as we go through the loop
        return y.iloc[0]
    else:
        return pd.Series()
df2 = df.merge(df.ColB.apply(do_the_thing), how = 'left', left_index = True, right_index = True)

Notice that this returns exactly the same result as before, we haven't changed the logic. the apply method sorts out the indexes, so we can just merge, fine.

I believe that answers your question in terms of speeding it up and being a bit more idiomatic.

I think you should consider however, what you want to do with this data structure, and how you might better structure what you're doing.

Given ColB could be of arbitrary length, you will end up with a dataframe with an arbitrary number of columns. When you come to access these values for whatever purpose, this will cause you pain, whatever the purpose is.

Are all entries in ColB important? Could you get away with just keeping the first one? Do you need to know the index of a certain valA val?

These are questions you should ask yourself, then decide on a structure which will allow you to do whatever analysis you need, without having to check a bunch of arbitrary things.

Upvotes: 4

Related Questions