Reputation: 388
I need to transform following data frame with json values in column into dataframe columnar structure so that it will be taking less space and easy to compute.
Sample DataFrame:
obs_id | date | obs |
---|---|---|
I2213 | 2021-12-31 23:20:02.761008 | "[{'type': 'air', 'results': {'bat': {'F1': 0.1, 'F2': 0.2}}, {'type': 'water', 'results': {'neo': {'F1': 0.3}}]" |
I2213 | 2022-01-01 23:20:02.761008 | "[{'type': 'earth', 'results': {'cat': {'F1': 0.4}}]" |
I2213 | 2022-01-02 23:20:02.761008 | "[{'type': 'air', 'results': {'bat': {'F1': 0.2, 'F2': 0.1}}]" |
Required Transformation format:
obs_id | date | obs.air.bat.F1 | obs.air.bat.F2 | obs.water.neo.F1 | obs.earth.cat.F1 |
---|
not sure if multi-level columns will suit better here.
I tried to create a separate dataframe from obs column like:
df1 = pd.DataFrame(df['obs'].values.tolist())
but since it contains list instead of dictionary, it doesn't work. Is it possible to achieve the require format?
Upvotes: 1
Views: 4071
Reputation: 30971
Let's start from a correction to your source data. As your sample contains unbalanced parentheses, probably its actual content should be:
obs_id date obs
0 I2213 2021-12-31 [{'type': 'air', 'results': {'bat': {'F1': 0.1, 'F2': 0.2}}}, {'type': 'water', 'results': {'neo': {'F1': 0.3}}}]
1 I2213 2022-01-01 [{'type': 'earth', 'results': {'cat': {'F1': 0.4}}}]
2 I2213 2022-01-02 [{'type': 'air', 'results': {'bat': {'F1': 0.2, 'F2': 0.1}}}]
To keep printouts of reasonable breadth, I dropped the time part from your date column.
Start the coding part from necessary imports:
import pandas as pd
import json
Then define a row processing function as:
def procRow(row):
wrk1 = pd.json_normalize(json.loads(row.obs.replace("'", '"')))
wrk2 = wrk1.set_index('type').stack().reset_index()
return pd.Series(wrk2[0].values, index='obs.' + wrk2.type\
+ wrk2.level_1.str.slice(7))
And concatenate first 2 columns of df with the result of application of this function to each row:
result = pd.concat([df.iloc[:, 0:2], df.apply(procRow, axis=1)], axis=1)
The result is:
obs_id date obs.air.bat.F1 obs.air.bat.F2 obs.earth.cat.F1 obs.water.neo.F1
0 I2213 2021-12-31 0.1 0.2 NaN 0.3
1 I2213 2022-01-01 NaN NaN 0.4 NaN
2 I2213 2022-01-02 0.2 0.1 NaN NaN
Upvotes: 1