an10b3
an10b3

Reputation: 273

extract specific field by index from json string stored in dataframe

I have a dataframe that contains json style dictionaries per row, and I need to extract the fourth and eigth fields, or values from the second and fourth pairs to a new column i.e. 'a' for row 1, 'a' for 2 (corresponding to the Group) and '10786' for row 1, '38971' for row 2 (corresponding to Code). The expected output is below.

dat = pd.DataFrame({ 'col1': ['{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}',
                          '{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}']})

expected output

                                                   a Group   Code
0  {"Number":1,"Group":"a","First":"Yes","Second"...     a  10786
1  {"Number":2,"Group":"a","First":"Yes","Second"...     a  38971

I have tried indexing by location but its printing only characters rather than fields e.g.

tuple(dat['a'].items())[0][1][4]

I also cannot appear to normalize the data with json_normalize, which I'm not sure why - perhaps the json string is stored suboptimally. So I am quite confused, and any tips would be great. Thanks so much

Upvotes: 0

Views: 352

Answers (1)

Luke Tonin
Luke Tonin

Reputation: 51

The reason pd.json_normalize is not working is because pd.json_normalize works on dictionaries and your strings are not json compatible.

Your strings are not json compatible because the "Labs" values contain multiple quotes which aren't escaped. It's possible to write a quick function to make your strings json compatible, then parse them as dictionaries, and finally use pd.json_normalize.

import pandas as pd
import json
import re

jsonify = lambda x: re.sub(pattern='"Labs":"(.*)"', repl='"Labs":\g<1>', string=x) # Function to remove unnecessary quotes

dat = pd.DataFrame({ 'col1': ['{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}',
                          '{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}']})

json_series = dat['col1'].apply(jsonify) # Remove unnecessary quotes
json_series = json_series.apply(json.loads) # Convert string to dictionary
output = pd.json_normalize(json_series) # "Flatten" the dictionary into columns
output.insert(loc=0, column='a', value=dat['col1']) # Add the original column as a column named "a" because that's what the question requests.
a Number Group First Code Desc Labs
0 {"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"} 1 a Yes 10786 True ['a', 'b', 'c']
1 {"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"} 2 a Yes 38971 True ['a', 'b', 'c']

Upvotes: 1

Related Questions