Reputation: 43
I'm new to Python and I have encountered an issue regarding Unicode text content and JSON fields.
My goal is to read some text files that contain Unicode characters and extract the whole content and put them into JSON fields. However, the JSON fields will contain the encoding(UTF-8) instead of the original Unicode characters(eg: JSON will have \u00e8\u0107 instead of èć). How can I direct the whole text file content into the JSON field?
Here is my code:
import json
file_1 = open('utf8_1.txt', 'r', encoding='utf-8').read()
file_2 = open('utf8_2.txt', 'r', encoding='utf-8').read()
with open("test.json", "r") as jsonFile:
data = json.load(jsonFile)
data[0]['field_1'] = file_1
data[0]['field_2'] = file_2
with open("test.json", "w") as jsonFile:
json.dump(data, jsonFile)
Here are two files that have Unicode characters:
utf8_1.txt:
Kèććia
ivò
utf8_2.txt:
ććiùri
iχa
Here is the test.json: (note: Two fields are set to be empty and need to be updated with the file content)
[
{
"field_1": "",
"field_2": ""
}
]
and here is what I got on test.json from running the code above:
[
{
"field_1": "K\u00e8\u0107\u0107ia\niv\u00f2",
"field_2": "\u0107\u0107i\u00f9ri\ni\u03c7a"
}
]
But my expected output for test.json is something like the following:
[
{
"field_1": "Kèććia ivò",
"field_2": "ććiùri iχa"
}
]
My goal is to put whatever in the utf8_1.txt into field_1 and whatever in the utf8_2.txt into field_2 in test.json. Preferably a string value would be the best. I have stuck on this for a long time. I really appreciate your help!
Upvotes: 1
Views: 2441
Reputation: 177471
What you get is valid UTF-8 JSON. It's just written as pure ASCII using escape codes for non-ASCII characters, which as a subset of UTF-8 is also valid UTF-8. Read it back in with json.load
and it will be the original string. If you want the actual Unicode characters encoded as UTF-8 instead of escape codes when written to the file, use json.dump
with the ensure_ascii=False
parameter, and make sure to open the file with encoding='utf8'
:
with open("test.json", "w", encoding='utf8') as jsonFile:
json.dump(data, jsonFile, ensure_ascii=False)
This is in the documentation:
json.dump
(obj, fp, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)
...
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.
Upvotes: 4