Reputation: 93
I have a text file which looks like this:
but all unicode chars need to be replaced with corresponding characters and should look like this:
the problem is that I do not want to replace all unicode codes by myself, what is the most efficient way to do this automatically? my code looks like this right now but it needs to be refined for sure!(the code is in Python3)
import io
input = io.open("input.json", "r", encoding="utf-8")
output = io.open("output.txt", "w", encoding="utf-8")
with input, output:
# Read input file.
file = input.read()
file = file.replace("\\u00e4", "ä")
# I think last line is the same as line below:
# file = file .replace("\\u00e4", u"\u00e4")
file = file.replace("\\u00c4", "Ä")
file = file.replace("\\u00f6", "ö")
file = file.replace("\\u00d6", "Ö")
.
.
.
# I cannot put all codes in unicode here manually!
.
.
.
# writing output file
output.write(file)
Upvotes: 0
Views: 6654
Reputation: 1121624
Just decode the JSON as JSON, and then write out a new JSON document without ensuring the data is ASCII safe:
import json
with open("input.json", "r", encoding="utf-8") as input:
with open("output.txt", "w", encoding="utf-8") as output:
document = json.load(input)
json.dump(document, output, ensure_ascii=False)
From the json.dump()
documentation:
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.
Demo:
>>> import json
>>> print(json.loads(r'"l\u00f6yt\u00e4\u00e4"'))
löytää
>>> print(json.dumps(json.loads(r'"l\u00f6yt\u00e4\u00e4"')))
"l\u00f6yt\u00e4\u00e4"
>>> print(json.dumps(json.loads(r'"l\u00f6yt\u00e4\u00e4"'), ensure_ascii=False))
"löytää"
If you have extremely large documents, you could still process them textually, line by line, but use regular expressions to do the replacements:
import re
unicode_escape = re.compile(
r'(?<!\\)'
r'(?:\\u([dD][89abAB][a-fA-F0-9]{2})\\u([dD][c-fC-F][a-fA-F0-9]{2})'
r'|\\u([a-fA-F0-9]{4}))')
def replace(m):
return bytes.fromhex(''.join(m.groups(''))).decode('utf-16-be')
with open("input.json", "r", encoding="utf-8") as input:
with open("output.txt", "w", encoding="utf-8") as output:
for line in input:
output.write(unicode_escape.sub(replace, line))
This however fails if your JSON has embedded JSON documents in strings or if the escape sequence is preceded by an escaped backslash.
Upvotes: 6