Reputation: 2251
My string is Niệm Bồ Tát (Thiá»n sÆ° Nhất Hạnh)
and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh)
. I see in that site can do that http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx
and I start to try by Python
mystr = '09. Bát Nhã Tâm Kinh'
mystr.decode('utf-8')
but actually it is not correct because original string is utf-8 but the string show is not my expecting result.
Note: it is Vietnamese character.
How to resolve that case? Is that Windows Unicode or something? How to detect the encoding here.
Upvotes: 16
Views: 36632
Reputation: 68
The correct method in python 3.9.6 is:
"string".encode('utf-8').decode('latin-1')
"string".encode('latin1').decode('utf8')
So, you can use:
'09. Bát Nhã Tâm Kinh'.encode('latin1').decode('utf8')
and the output is:
>>> '09. Bát Nhã Tâm Kinh'.encode('latin1').decode('utf8')
'09. Bát Nhã Tâm Kinh'
Upvotes: 0
Reputation: 1003
I'm not sure what you can do with these kind of data, but for your example in your original post, this works (Python 3.x):
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> s = mystr.encode('latin1').decode('utf8')
>>> s
'09. Bát Nhã Tâm Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh
Upvotes: 19
Reputation:
Try:
str.encode('ascii', 'ignore').decode('utf-8')
You're encoding the string in ASCII format / ignoring the errors and decoding in UTF-8. This may remove the accents, but it's one approach.
Upvotes: 4
Reputation: 11185
The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy
This module fixes pretty much everything and works much better than online decoders.
>>> from ftfy import fix_encoding
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'
It can be easily installed using pip install ftfy
Upvotes: 35