Replace all unicode codes with characters in python

Question

I have a text file which looks like this:

l\u00f6yt\u00e4\u00e4

but all unicode chars need to be replaced with corresponding characters and should look like this:

löytää

the problem is that I do not want to replace all unicode codes by myself, what is the most efficient way to do this automatically? my code looks like this right now but it needs to be refined for sure!(the code is in Python3)

import io
input = io.open("input.json", "r", encoding="utf-8")
output = io.open("output.txt", "w", encoding="utf-8")
with input, output:
    # Read input file.
    file = input.read()
    file = file.replace("\u00e4", "ä")
    # I think last line is the same as line below:
    # file = file .replace("\u00e4", u"\u00e4")
    file = file.replace("\u00c4", "Ä")
    file = file.replace("\u00f6", "ö")
    file = file.replace("\u00d6", "Ö")
    .
    .
    . 
    # I cannot put all codes in unicode here manually!
    .
    .
    .
    # writing output file
    output.write(file)

Martijn Pieters · Accepted Answer

Just decode the JSON as JSON, and then write out a new JSON document without ensuring the data is ASCII safe:

import json

with open("input.json", "r", encoding="utf-8") as input:
    with open("output.txt", "w", encoding="utf-8") as output:
        document = json.load(input)
        json.dump(document, output, ensure_ascii=False)

From the json.dump() documentation:

If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.

Demo:

>>> import json
>>> print(json.loads(r'"l\u00f6yt\u00e4\u00e4"'))
löytää
>>> print(json.dumps(json.loads(r'"l\u00f6yt\u00e4\u00e4"')))
"l\u00f6yt\u00e4\u00e4"
>>> print(json.dumps(json.loads(r'"l\u00f6yt\u00e4\u00e4"'), ensure_ascii=False))
"löytää"

If you have extremely large documents, you could still process them textually, line by line, but use regular expressions to do the replacements:

import re

unicode_escape = re.compile(
    r'(?



This however fails if your JSON has embedded JSON documents in strings or if the escape sequence is preceded by an escaped backslash.

Replace all unicode codes with characters in python

Answers (1)

Related Questions