Phelipe
Phelipe

Reputation: 309

String to Json Handling

I have a database with one nested column. I need to split each value inside this nested cells. So my idea is to take this nested string and transform in json with json.loads then use json_normalize to create a dataframe with this nested values but my data has some values that are causing me some problems, like this:

df['column'][1]:
"[{'id': 124, 'status': 'incomplete', 'formation': 'asddas', 'institution': 'asdasd', 'course': 'asdsads', 'startMonth': 7, 'startYear': 2018, 'endMonth': xx, 'endYear': None}, {'id': 7654414, 'status': 'complete', 'formation': 'technological', 'institution': 'asdasdada', 'course': 'asdasdasdsa', 'startMonth': 1, 'startYear': 2016, 'endMonth': 6, 'endYear': 2018}]"

df['column'][2]:
'[{"institution":"asdadsad","startMonth":2,"startYear":1992,"course":"prop e.mkt e adm empresas","id":11111,"formation":"asdasd","endMonth":12,"endYear":1996,"status":"complete"},{"institution":"Universidade Sant\'Anna","startMonth":2,"startYear":1998,"course":"asdasdasd","id":2007186,"formation":"asdasda","endMonth":12,"endYear":2002,"status":"complete"}]'

As you can see, the first cell has single quotes, so I used df['column'].str.replace("'", '"') to put double quotes then I used json.loads to transform to json and it worked.

But after the transformation the single quote in the second line

({"institution":"Universidade Sant\'Anna") 

became

(Sant"Anna) 

so the json.loads raises a error.

df['column'][2]: (After transformation)
'[{"institution":"asdadsad","startMonth":2,"startYear":1992,"course":"prop e.mkt e adm empresas","id":11111,"formation":"asdasd","endMonth":12,"endYear":1996,"status":"complete"},{"institution":"Universidade Sant"Anna","startMonth":2,"startYear":1998,"course":"asdasdasd","id":2007186,"formation":"asdasda","endMonth":12,"endYear":2002,"status":"complete"}]'

json.loads(df['column'])

Error

JSONDecodeError: Expecting ',' delimiter: line 1 column 247 (char 246)

I tried a lot of transformations to solve the problem but I wasn't able to find a solution that solve the two situations in a row.

Upvotes: 0

Views: 60

Answers (1)

Jonathan Leon
Jonathan Leon

Reputation: 5648

Because you don't have to do any transformations to rows with double quotes, apply a lambda function to test the beginning of the string.

If I set your first example to s, I can test s.startswith("[{'") to look for a single quote. This returns True

something like this (untested but should be close). It'll return a transformed string only if it starts with a single quote.

df['column'].apply(lambda x: x.replace("'", '"') if x.startswith("[{'") else x)

Upvotes: 1

Related Questions