Using Pandas to parse a JSON column w/nested values in a huge CSV

Question

I have a huge CSV file (3.5GB and getting bigger everyday) which has normal values and one column called 'Metadata' with nested JSON values. My script is as below and the intention is simply to convert the JSON column into normal columns for each of its key-value pairs. I am using Python3 (Anaconda; Windows).

import pandas as pd
import numpy as np
import csv
import datetime as dt

from pandas.io.json import json_normalize

for df in pd.read_csv("source.csv", engine='c', 
    dayfirst=True, 
    encoding='utf-8', 
    header=0,
    nrows=10,
    chunksize=2,
    converters={'Metadata':json.loads}):

    ## parsing code comes here

    with open("output.csv", 'a', encoding='utf-8') as ofile:
        df.to_csv(ofile, index=False, encoding='utf-8')

And the column has JSON in the following format:

{  
   "content_id":"xxxx",
   "parental":"F",
   "my_custom_data":{  
      "GroupId":"NA",
      "group":null,
      "userGuid":"xxxxxxxxxxxxxx",
      "deviceGuid":"xxxxxxxxxxxxx",
      "connType":"WIFI",
      "channelName":"VOD",
      "assetId":"xxxxxxxxxxxxx",
      "GroupName":"NA",
      "playType":"VOD",
      "appVersion":"2.1.0",
      "userEnvironmentContext":"",
      "vodEncode":"H.264",
      "language":"English"
   }
}

The desired output is to have all the above key-value pairs as columns. The dataframe will have other non-JSON columns to which I need to add the columns parsed from the above JSON. I tried json_normalize but I am not sure how to apply json_normalize to a Series object and then convert it (or explode it) into multiple columns.

Martijn Pieters · Accepted Answer

Just use json_normalize() on the series directly, and then use pandas.concat() to merge the new dataframe with the existing dataframe:

pd.concat([df, json_normalize(df['Metadata'])])

You can add a .drop('Metadata', axis=1) if you no longer need the old column with the JSON datastructure in it.

The columns produced for the my_custom_data nested dictionary will have my_custom_data. prefixed. If all the names in that nested dictionary are unique, you could drop that prefix with a DataFrame.rename() operation:

json_normalize(df['Metadata']).rename(
    columns=lambda n: n[15:] if n.startswith('my_custom_data.') else n)

If you are using some other means to convert each dictionary value to a flattened structure (say, with flatten_json, then you want to use Series.apply() to process each value and then return each resulting dictionary as a pandas.Series() object:

def some_conversion_function(dictionary):
    result = something_that_processes_dictionary_into_a_flat_dict(dictionary)
    return pd.Series(something_that_processes_dictionary_into_a_flat_dict)

You can then concatenate the result of the Series.apply() call (which will be a dataframe) back onto your original dataframe:

pd.concat([df, df['Metadata'].apply(some_conversion_function)])

Using Pandas to parse a JSON column w/nested values in a huge CSV

Answers (1)

Related Questions