extract specific field by index from json string stored in dataframe

Question

I have a dataframe that contains json style dictionaries per row, and I need to extract the fourth and eigth fields, or values from the second and fourth pairs to a new column i.e. 'a' for row 1, 'a' for 2 (corresponding to the Group) and '10786' for row 1, '38971' for row 2 (corresponding to Code). The expected output is below.

dat = pd.DataFrame({ 'col1': ['{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}',
                          '{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}']})

expected output

                                                   a Group   Code
0  {"Number":1,"Group":"a","First":"Yes","Second"...     a  10786
1  {"Number":2,"Group":"a","First":"Yes","Second"...     a  38971

I have tried indexing by location but its printing only characters rather than fields e.g.

tuple(dat['a'].items())[0][1][4]

I also cannot appear to normalize the data with json_normalize, which I'm not sure why - perhaps the json string is stored suboptimally. So I am quite confused, and any tips would be great. Thanks so much

Luke Tonin · Accepted Answer

The reason pd.json_normalize is not working is because pd.json_normalize works on dictionaries and your strings are not json compatible.

Your strings are not json compatible because the "Labs" values contain multiple quotes which aren't escaped. It's possible to write a quick function to make your strings json compatible, then parse them as dictionaries, and finally use pd.json_normalize.

import pandas as pd
import json
import re

jsonify = lambda x: re.sub(pattern='"Labs":"(.*)"', repl='"Labs":\g<1>', string=x) # Function to remove unnecessary quotes

dat = pd.DataFrame({ 'col1': ['{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}',
                          '{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}']})

json_series = dat['col1'].apply(jsonify) # Remove unnecessary quotes
json_series = json_series.apply(json.loads) # Convert string to dictionary
output = pd.json_normalize(json_series) # "Flatten" the dictionary into columns
output.insert(loc=0, column='a', value=dat['col1']) # Add the original column as a column named "a" because that's what the question requests.

	a	Number	Group	First	Code	Desc	Labs
0	{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}	1	a	Yes	10786	True	['a', 'b', 'c']
1	{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}	2	a	Yes	38971	True	['a', 'b', 'c']

extract specific field by index from json string stored in dataframe

Answers (1)

Related Questions