Reputation: 33
I have a JSON file from the Facebook's "Download your data" feature and instead of escaping Unicode characters as their codepoint number, it's escaped just as a sequence of UTF-8 bytes.
For example, the letter á (U+00E1) is escaped in the JSON file as \u00c3\u00a1
instead of \u00e1
. 0xC3 0xA1 is UTF-8 encoding for U+00E1.
The json
library in Python 3 decodes it as á which corresponds to U+00C3 and U+00A1.
Is there a way to parse such a file correctly (so that I get the letter á) in Python?
Upvotes: 3
Views: 2190
Reputation: 1001
It seems they encoded their Unicode string into bytes using utf-8 then transformed the bytes into JSON. This is very bad behaviour from them.
Python 3 example:
>>> '\u00c3\u00a1'.encode('latin1').decode('utf-8')
'á'
You need to parse the JSON and walk the entire data to fix it:
def visit_list(l):
return [visit(item) for item in l]
def visit_dict(d):
return {visit(k): visit(v) for k, v in d.items()}
def visit_str(s):
return s.encode('latin1').decode('utf-8')
def visit(node):
funcs = {
list: visit_list,
dict: visit_dict,
str: visit_str,
}
func = funcs.get(type(node))
if func:
return func(node)
else:
return node
incorrect = '{"foo": ["\u00c3\u00a1", 123, true]}'
correct_obj = visit(json.loads(incorrect))
Upvotes: 3