Reputation: 57
i received single JSON's (500 JSON's) and modified it by adding them to the end of an existing list with the append() method.
d_path = r'--PATH HERE--'
d_files = [f for f in listdir(d_path) if isfile(join(d_path,f))]
n = num_data
d_dicts=[]
for counter,d_file in enumerate(d_files):
with open(d_path+'\\'+d_file,encoding="utf8") as json_data:
d_dicts.append(json.load(json_data))
if counter == num_data:
break
After this step i tried using json_normalize to normalize the JSON data into a flat table (Pandas DataFrame with a total of 500 rows).
df = json_normalize(d)
Additional Info:
class 'pandas.core.frame.DataFrame'
dtypes: float64(8), int64(3), object(9)
So far it worked out nicely except for one column. I ended up having one column with a List of dictionaries in each row. I tried to look for a solution but I can't find one that helps me. Each row has a Nested dictionary.
Here is an example of three rows of the column named Info_column with fictional data but the same structure:
Info_column
[{**'Greeting':** 'Good day', 'Group': '1.2', 'Window': None,
'Value1': 17.0, 'Value2': 13.23, 'Value3': 11.0,
'Date1': '2013-09-04', 'Date2': '2012-09-05', 'Date3': '2015-07-22',
'Married': False, 'Country': None,
'Person': [{'Age': '25', 'Number': '82', 'Value4': 19.2,
'Column1': None, 'Column2': None, 'Column3': None, 'Column4': None}]}]
[{'Greeting': 'Good afternoon', 'Group': '1.4', 'Window': None,
'Value1': 12.0, 'Value2': 9.23, 'Value3': 2.0,
'Date1': '2016-09-04', 'Date2': '2016-09-16', 'Date3': '2016-07-05',
'Married': True, 'Country': Germany,
'Person': [{'Age': '30', 'Number': '9', 'Value4': 10.0,
'Column1': None, 'Column2': None, 'Column3': None, 'Column4': None}]}]
[{'Greeting': 'Good evening', 'Group': '3.0', 'Window': True,
'Value1': 24.0, 'Value2': 15.5, 'Value3': 2.0,
'Date1': '2019-02-01', 'Date2': '2019-05-05', 'Date3': '2018-05-03',
'Married': False, 'Country': Spain,
'Person': [{'Age': '24', 'Number': '12', 'Value4': 8.2,
'Column1': None, 'Column2': None, 'Column3': None, 'Column4': None}]}]
What is the correct way?
My goal is to have the Information for every row in this column as additional columns in my dataframe.
Columns that I need as additional columns next to the other columns in my DataFrame df:
Greeting, Group, Window, Value1, Value2, Value3, Date1, Date2, Date3, Married, Country, Person_Age, Person_Number, Person_Value4, Person_Column1, Person_Column2, Person_Column3, Person_Column4
Thanks a lot for your help
Regards, Elle
Upvotes: 3
Views: 4921
Reputation: 5502
You can try the following approach:
def f(x):
d = {}
# Each element of the dict
for k,v in x.items():
# Check value type
if isinstance(v,list):
# If list: iter sub dict
for k_s, v_s in v[0].items():
d["{}_{}".format(k, k_s)] = v_s
else: d[k] = v
return pd.Series(d)
out = df.join(df["Info_column"].apply(f))\
.drop("Info_column", axis=1)
Explanations:
All the question is about flattening the Info_column
. To do that, we define a flatten function: "flatten"
. It does the following:
list
:
Apply the flatten
function to the Info_column
using apply
Join the current dataframe with output from previous step using join
Remove Info_column
using drop with axis=1
.
Full Code + illustration:
# Create dummy dataset with 3 columns
data = [["a", 1, {'Greeting': 'Good day', 'Group': '1.2', 'Window': None,
'Value1': 17.0, 'Value2': 13.23, 'Value3': 11.0,
'Date1': '2013-09-04', 'Date2': '2012-09-05', 'Date3': '2015-07-22',
'Married': False, 'Country': None,
'Person': [{'Age': '25', 'Number': '82', 'Value4': 19.2,
'Column1': None, 'Column2': None, 'Column3': None, 'Column4': None}]}],
["b", 5, {'Greeting': 'Good afternoon', 'Group': '1.4', 'Window': None,
'Value1': 12.0, 'Value2': 9.23, 'Value3': 2.0,
'Date1': '2016-09-04', 'Date2': '2016-09-16', 'Date3': '2016-07-05',
'Married': True, 'Country': "Germany",
'Person': [{'Age': '30', 'Number': '9', 'Value4': 10.0,
'Column1': None, 'Column2': None, 'Column3': None, 'Column4': None}]}],
["c", 2, {'Greeting': 'Good evening', 'Group': '3.0', 'Window': True,
'Value1': 24.0, 'Value2': 15.5, 'Value3': 2.0,
'Date1': '2019-02-01', 'Date2': '2019-05-05', 'Date3': '2018-05-03',
'Married': False, 'Country': "Spain",
'Person': [{'Age': '24', 'Number': '12', 'Value4': 8.2,
'Column1': None, 'Column2': None, 'Column3': None, 'Column4': None}]}]]
df = pd.DataFrame(data, columns=["colA", "colB", "Info_column"])
print(df)
# colA colB Info_column
# 0 a 1 {'Greeting': 'Good day', 'Group': '1.2', 'Wind...
# 1 b 5 {'Greeting': 'Good afternoon', 'Group': '1.4',...
# 2 c 2 {'Greeting': 'Good evening', 'Group': '3.0', '...
# Step 1
def flatten(x):
d = {}
# Each element of the dict
for k,v in x.items():
# Check value type
if isinstance(v,list):
# If list: iter sub dict
for k_s, v_s in v[0].items():
d["{}_{}".format(k, k_s)] = v_s
else: d[k] = v
return pd.Series(d)
# Step 2
print(df["Info_column"].apply(flatten))
# Greeting Group Window Value1 Value2 Value3 ... Person_Number Person_Value4 Person_Column1 Person_Column2 Person_Column3 Person_Column4
# 0 Good day 1.2 None 17.0 13.23 11.0 ... 82 19.2 None None None None
# 1 Good afternoon 1.4 None 12.0 9.23 2.0 ... 9 10.0 None None None None
# 2 Good evening 3.0 True 24.0 15.50 2.0 ... 12 8.2 None None None None
# [3 rows x 18 columns]
# Step 3
print(df.join(df["Info_column"].apply(flatten)))
# colA colB Info_column Greeting ... Person_Column1 Person_Column2 Person_Column3 Person_Column4
# 0 a 1 {'Greeting': 'Good day', 'Group': '1.2', 'Wind... Good day ... None None None None
# 1 b 5 {'Greeting': 'Good afternoon', 'Group': '1.4',... Good afternoon ... None None None None
# 2 c 2 {'Greeting': 'Good evening', 'Group': '3.0', '... Good evening ... None None None None
# [3 rows x 21 columns]
# Step 4
out = df.join(df["Info_column"].apply(flatten)).drop("Info_column", axis=1)
print(out)
# colA colB Greeting Group Window Value1 ... Person_Number Person_Value4 Person_Column1 Person_Column2 Person_Column3 Person_Column4
# 0 a 1 Good day 1.2 None 17.0 ... 82 19.2 None None None None
# 1 b 5 Good afternoon 1.4 None 12.0 ... 9 10.0 None None None None
# 2 c 2 Good evening 3.0 True 24.0 ... 12 8.2 None None None None
# [3 rows x 20 columns]
print(out.columns)
# Index(['colA', 'colB', 'Greeting', 'Group', 'Window', 'Value1', 'Value2',
# 'Value3', 'Date1', 'Date2', 'Date3', 'Married', 'Country', 'Person_Age',
# 'Person_Number', 'Person_Value4', 'Person_Column1', 'Person_Column2',
# 'Person_Column3', 'Person_Column4'],
# dtype='object')
Upvotes: 5