Reputation: 273
I have a dataframe that contains json style dictionaries per row, and I need to extract the fourth and eigth fields, or values from the second and fourth pairs to a new column i.e. 'a' for row 1, 'a' for 2 (corresponding to the Group) and '10786' for row 1, '38971' for row 2 (corresponding to Code). The expected output is below.
dat = pd.DataFrame({ 'col1': ['{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}',
'{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}']})
expected output
a Group Code
0 {"Number":1,"Group":"a","First":"Yes","Second"... a 10786
1 {"Number":2,"Group":"a","First":"Yes","Second"... a 38971
I have tried indexing by location but its printing only characters rather than fields e.g.
tuple(dat['a'].items())[0][1][4]
I also cannot appear to normalize the data with json_normalize, which I'm not sure why - perhaps the json string is stored suboptimally. So I am quite confused, and any tips would be great. Thanks so much
Upvotes: 0
Views: 352
Reputation: 51
The reason pd.json_normalize
is not working is because pd.json_normalize
works on dictionaries and your strings are not json compatible.
Your strings are not json compatible because the "Labs" values contain multiple quotes which aren't escaped.
It's possible to write a quick function to make your strings json compatible, then parse them as dictionaries, and finally use pd.json_normalize
.
import pandas as pd
import json
import re
jsonify = lambda x: re.sub(pattern='"Labs":"(.*)"', repl='"Labs":\g<1>', string=x) # Function to remove unnecessary quotes
dat = pd.DataFrame({ 'col1': ['{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}',
'{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}']})
json_series = dat['col1'].apply(jsonify) # Remove unnecessary quotes
json_series = json_series.apply(json.loads) # Convert string to dictionary
output = pd.json_normalize(json_series) # "Flatten" the dictionary into columns
output.insert(loc=0, column='a', value=dat['col1']) # Add the original column as a column named "a" because that's what the question requests.
a | Number | Group | First | Code | Desc | Labs | |
---|---|---|---|---|---|---|---|
0 | {"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"} | 1 | a | Yes | 10786 | True | ['a', 'b', 'c'] |
1 | {"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"} | 2 | a | Yes | 38971 | True | ['a', 'b', 'c'] |
Upvotes: 1