花小田
花小田

Reputation: 21

python3 Unicode to chinese

I have a coding problem, have a json, and now I need to convert the content field to traditional Chinese which may contain emjo and the like, I hope to do it with python3,The json file example is as follows:

"messages": [
    {
      "sender_name": "#20KAREL\u00e2\u0080\u0099s \u00f0\u009f\u008e\u0088\u00f0\u009f\u0092\u009b",
      "timestamp_ms": 1610288228221,
      "content": "\u00e6\u0088\u0091\u00e9\u009a\u0094\u00e9\u009b\u00a2",
      "type": "Generic",
      "is_unsent": false
    },
    {
      "sender_name": "#20KAREL\u00e2\u0080\u0099s \u00f0\u009f\u008e\u0088\u00f0\u009f\u0092\u009b",
      "timestamp_ms": 1610288227699,
      "share": {
        "link": "https://www.instagram.com/p/B6UlYZvA4Pd/",
        "share_text": "//\nMemorabilia\u00f0\u009f\u0087\u00b0\u00f0\u009f\u0087\u00b7\u00f0\u009f\u0091\u00a9\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a9\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a7\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a7\u00f0\u009f\u0091\u00a8\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a8\u00e2\u0080\u008d\u00f0\u009f\u0091\u00a6\n\u00f0\u009f\u0098\u0086\u00f0\u009f\u00a4\u00a3\u00f0\u009f\u00a4\u00ac\u00f0\u009f\u0098\u008c\u00f0\u009f\u0098\u00b4\u00f0\u009f\u00a4\u00a9\u00f0\u009f\u00a4\u0093\n#191214\n#191221",
        "original_content_owner": "_ki.zeng"
      },
      "type": "Share",
      "is_unsent": false
    },
    {
      "sender_name": "#20KAREL\u00e2\u0080\u0099s \u00f0\u009f\u008e\u0088\u00f0\u009f\u0092\u009b",
      "timestamp_ms": 1607742844729,
      "content": "\u00e6\u0089\u00ae\u00e7\u009e\u0093\u00e5\u00b0\u00b1\u00e5\u00a5\u00bd",
      "type": "Generic",
      "is_unsent": false
    }]

Upvotes: 0

Views: 441

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 178115

The data posted isn't valid JSON (at least missing a set of outer curly braces) and was encoded incorrectly. UTF-8 bytes were written as Unicode code points. Ideally correct the original code, but the following will fix the mess you have now, if "input.json" is the original data with the outer curly braces added:

import json

# Read the raw bytes of the data file
with open('input.json','rb') as f:
    raw = f.read()

# There are some newline escapes that shouldn't be converted,
# so double-escape them so the result leaves them escaped.
raw = raw.replace(rb'\n',rb'\\n')

# Convert all the escape codes to Unicode characters
raw = raw.decode('unicode_escape')

# The characters are really UTF-8 byte values.
# The "latin1" codec translates Unicode code points 1:1 to byte values,
# resulting in a byte string again.
raw = raw.encode('latin1')

# Decode correctly as UTF-8
raw = raw.decode('utf8')

# Now that the JSON is fixed, load it into a Python object
data = json.loads(raw)

# Re-write the JSON correctly.
with open('output.json','w',encoding='utf8') as f:
    json.dump(data,f,ensure_ascii=False,indent=2)

Result:

{
  "messages": [
    {
      "sender_name": "#20KAREL’s 🎈💛",
      "timestamp_ms": 1610288228221,
      "content": "我隔離",
      "type": "Generic",
      "is_unsent": false
    },
    {
      "sender_name": "#20KAREL’s 🎈💛",
      "timestamp_ms": 1610288227699,
      "share": {
        "link": "https://www.instagram.com/p/B6UlYZvA4Pd/",
        "share_text": "//\nMemorabilia🇰🇷👩‍👩‍👧‍👧👨‍👨‍👦\n😆🤣🤬😌😴🤩🤓\n#191214\n#191221",
        "original_content_owner": "_ki.zeng"
      },
      "type": "Share",
      "is_unsent": false
    },
    {
      "sender_name": "#20KAREL’s 🎈💛",
      "timestamp_ms": 1607742844729,
      "content": "扮瞓就好",
      "type": "Generic",
      "is_unsent": false
    }
  ]
}

Upvotes: 1

Related Questions