pickle san
pickle san

Reputation: 43

How to get utf-8 text file into a Json object in Python 3

I'm new to Python and I have encountered an issue regarding Unicode text content and JSON fields.

My goal is to read some text files that contain Unicode characters and extract the whole content and put them into JSON fields. However, the JSON fields will contain the encoding(UTF-8) instead of the original Unicode characters(eg: JSON will have \u00e8\u0107 instead of èć). How can I direct the whole text file content into the JSON field?

Here is my code:

import json

file_1 = open('utf8_1.txt', 'r', encoding='utf-8').read()
file_2 = open('utf8_2.txt', 'r', encoding='utf-8').read()

with open("test.json", "r") as jsonFile:
    data = json.load(jsonFile)

data[0]['field_1'] = file_1
data[0]['field_2'] = file_2

with open("test.json", "w") as jsonFile:
    json.dump(data, jsonFile)

Here are two files that have Unicode characters:

utf8_1.txt:

Kèććia
ivò

utf8_2.txt:

ććiùri
iχa

Here is the test.json: (note: Two fields are set to be empty and need to be updated with the file content)

[
  {
    "field_1": "",
    "field_2": ""
  }
]

and here is what I got on test.json from running the code above:

[
  {
    "field_1": "K\u00e8\u0107\u0107ia\niv\u00f2",
    "field_2": "\u0107\u0107i\u00f9ri\ni\u03c7a"
  }
]

But my expected output for test.json is something like the following:

[
  {
    "field_1": "Kèććia ivò",
    "field_2": "ććiùri iχa"
  }
]

My goal is to put whatever in the utf8_1.txt into field_1 and whatever in the utf8_2.txt into field_2 in test.json. Preferably a string value would be the best. I have stuck on this for a long time. I really appreciate your help!

Upvotes: 1

Views: 2441

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 177471

What you get is valid UTF-8 JSON. It's just written as pure ASCII using escape codes for non-ASCII characters, which as a subset of UTF-8 is also valid UTF-8. Read it back in with json.load and it will be the original string. If you want the actual Unicode characters encoded as UTF-8 instead of escape codes when written to the file, use json.dump with the ensure_ascii=False parameter, and make sure to open the file with encoding='utf8':

with open("test.json", "w", encoding='utf8') as jsonFile:
    json.dump(data, jsonFile, ensure_ascii=False)

This is in the documentation:

json.dump(obj, fp, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)
...
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.

Upvotes: 4

Related Questions