Reputation: 749
I have a huge CSV file (3.5GB and getting bigger everyday) which has normal values and one column called 'Metadata' with nested JSON values. My script is as below and the intention is simply to convert the JSON column into normal columns for each of its key-value pairs. I am using Python3 (Anaconda; Windows).
import pandas as pd
import numpy as np
import csv
import datetime as dt
from pandas.io.json import json_normalize
for df in pd.read_csv("source.csv", engine='c',
dayfirst=True,
encoding='utf-8',
header=0,
nrows=10,
chunksize=2,
converters={'Metadata':json.loads}):
## parsing code comes here
with open("output.csv", 'a', encoding='utf-8') as ofile:
df.to_csv(ofile, index=False, encoding='utf-8')
And the column has JSON in the following format:
{
"content_id":"xxxx",
"parental":"F",
"my_custom_data":{
"GroupId":"NA",
"group":null,
"userGuid":"xxxxxxxxxxxxxx",
"deviceGuid":"xxxxxxxxxxxxx",
"connType":"WIFI",
"channelName":"VOD",
"assetId":"xxxxxxxxxxxxx",
"GroupName":"NA",
"playType":"VOD",
"appVersion":"2.1.0",
"userEnvironmentContext":"",
"vodEncode":"H.264",
"language":"English"
}
}
The desired output is to have all the above key-value pairs as columns. The dataframe will have other non-JSON columns to which I need to add the columns parsed from the above JSON. I tried json_normalize
but I am not sure how to apply json_normalize
to a Series object and then convert it (or explode it) into multiple columns.
Upvotes: 3
Views: 6224
Reputation: 1122172
Just use json_normalize()
on the series directly, and then use pandas.concat()
to merge the new dataframe with the existing dataframe:
pd.concat([df, json_normalize(df['Metadata'])])
You can add a .drop('Metadata', axis=1)
if you no longer need the old column with the JSON datastructure in it.
The columns produced for the my_custom_data
nested dictionary will have my_custom_data.
prefixed. If all the names in that nested dictionary are unique, you could drop that prefix with a DataFrame.rename()
operation:
json_normalize(df['Metadata']).rename(
columns=lambda n: n[15:] if n.startswith('my_custom_data.') else n)
If you are using some other means to convert each dictionary value to a flattened structure (say, with flatten_json
, then you want to use Series.apply()
to process each value and then return each resulting dictionary as a pandas.Series()
object:
def some_conversion_function(dictionary):
result = something_that_processes_dictionary_into_a_flat_dict(dictionary)
return pd.Series(something_that_processes_dictionary_into_a_flat_dict)
You can then concatenate the result of the Series.apply()
call (which will be a dataframe) back onto your original dataframe:
pd.concat([df, df['Metadata'].apply(some_conversion_function)])
Upvotes: 3