Reputation: 2480
First note that symbol β (Greek beta) have hex representation in UTF-8: CE B2
I have legacy source code in Python 2.7 that uses json strings:
u'{"something":"text \\u00ce\\u00b2 text..."}'
I then it calls json.loads(string) or json.loads(string, 'utf-8'), but the result is Unicode string with UTF-8 characters:
u'text \xce\xb2 text'
What I want is normal Python Unicode (UTF-16?) string:
u'text β text'
If I call:
text = text.decode('unicode_escape')
before json.loads, then I got correct Unicode β symbol, but it also breaks json by also replacing all new lines - \n
The question is, how to convert only "\\u00ce\\00b2"
part without affecting other json special characters?
(I am new to Python, and it is not my source code, so I have no idea how this is supposed to work. I suspect that the code only works with ASCII characters)
Upvotes: 1
Views: 1339
Reputation: 178409
Here's a string-fixer that works after loading the JSON. It handles any length UTF-8-like sequence and ignores escape sequences that don't look like UTF-8 sequences.
Example:
import json
import re
def fix(bad):
return re.sub(ur'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'),bad)
# 2- and 3-byte UTF-8-like sequences and onen correct escape code.
json_text = '''\
{
"something":"text \\u00ce\\u00b2 text \\u00e4\\u00bd\\u00a0\\u597d..."
}
'''
data = json.loads(json_text)
bad_str = data[u'something']
good_str = fix(bad_str)
print bad_str
print good_str
Output:
text β text ä½ 好...
text β text 你好...
Upvotes: 1
Reputation: 189937
Something like this, perhaps. This is limited to 2-byte UTF-8 characters.
import re
j = u'{"something":"text \\u00ce\\u00b2 text..."}'
def decodeu (match):
u = '%c%c' % (int(match.group(1), 16), int(match.group(2), 16))
return repr(u.decode('utf-8'))[2:8]
j = re.sub(r'\\u00([cd][0-9a-f])\\u00([89ab][0-9a-f])',decodeu, j)
print(j)
returns {"something":"text \u03b2 text..."}
for your sample. At this point, you can import it as regular JSON and get the final string you want.
result = json.loads(j)
Upvotes: 1