Reputation: 717
This should be a piece of cake, but I'm new to Python and I can't seem to understand how this should done:
I have a JSON file that I got by retrieving my personal data from Facebook, this is just a chunk of the file:
[
{
"timestamp": 1575826804,
"attachments": [
],
"data": [
{
"post": "This is a test line with character \u00c3\u00ad and \u00c3\u00b3"
},
{
"update_timestamp": 1575826804
}
],
"title": "My Name"
},
{
"timestamp": 1575826526,
"attachments": [
],
"data": [
{
"update_timestamp": 1575826526
}
],
"title": "My Name"
},
{
"timestamp": 1575638718,
"data": [
{
"post": "This is another test line with character \u00c3\u00ad and \u00c3\u00b3 and line breaks\n"
}
],
"title": "My Name escribi\u00c3\u00b3 en la biograf\u00c3\u00ada de Someone."
},
{
"timestamp": 1575561399,
"attachments": [
{
"data": [
{
"external_context": {
"url": "https://youtu.be/lalalalalalaaeeeE"
}
}
]
}
],
"data": [
{
"update_timestamp": 1575561399
}
],
"title": "My Name"
}
]
The file has many unicode codes like "\u00c3\u00ad" that I need to convert to ASCII representations. I tryed to parse this JSON file and load it as a Python object with the "json" library, first I did:
with open("test.json") as fp:
data = json.load(fp)
print(type(data))
print(data[0])
# output:
# <class 'list'>
# {'timestamp': 1575826804, 'attachments': [], 'data': [{'post': 'This is a test line with
# character Ã\xad and ó'}, {'update_timestamp': 1575826804}], 'title': 'My Name'}
Although I get a list object from json.load(), the accented characters are wrong: "Ã\xad" and "ó". Then I did:
with open("test.json", encoding='unicode-escape') as fp:
txt = fp.read().encode('latin1').decode('utf8')
data = json.loads(txt)
print(type(data))
print(data[2])
This seccond attemp will work only if the json string doesn't contain any character line newlines "\n" or ":" within a JSON value, but in cases like mine it will throw:
JSONDecodeError: Invalid control character at: line 33 column 82 (char 560)
Character 560 is the trailing "\n" inside a JSON value "post":
{
"post": "This is another test line with character \u00c3\u00ad and \u00c3\u00b3 and line breaks\n"
}
How should I correctly load this JSON with Unicodes? Is it replacing the unicode strings for ASCII characters the only way around?
Thanks in advance for your help!
Upvotes: 1
Views: 763
Reputation: 469
I think you need to use 'raw_unicode_escape'.
import json
with open("j.json", encoding='raw_unicode_escape') as f:
data = json.loads(f.read().encode('raw_unicode_escape').decode())
print(data[0])
OUT: {'timestamp': 1575826804, 'attachments': [], 'data': [{'post': 'This is a test line with character í and ó'}, {'update_timestamp': 1575826804}], 'title': 'My Name'}
Does this help?
Upvotes: 1