Reputation: 44305
I have a text Aur\xc3\xa9lien
and want to decode it with python 3.8.
I tried the following
import codecs
s = "Aur\xc3\xa9lien"
codecs.decode(s, "urf-8")
codecs.decode(bytes(s), "urf-8")
codecs.decode(bytes(s, "utf-8"), "utf-8")
but none of them gives the correct result Aurélien
.
How to do it correctly?
And is there no basic, general authoritative simple page that describes all these encodings for python?
Upvotes: 0
Views: 9945
Reputation: 649
First find the encoding of the string and then decode it... to do this you will need to make a byte string by adding the letter 'b' to the front of the original string.
Try this:
import chardet
s = "Aur\xc3\xa9lien"
bs = b"Aur\xc3\xa9lien"
encoding = chardet.detect(bs)["encoding"]
str = s.encode(encoding).decode("utf-8")
print(str)
If you are reading the text from a file you can detect the encoding using the magic
lib, see here: https://stackoverflow.com/a/16203777/1544937
Upvotes: 3
Reputation: 9523
Your string is not a Unicode sequence, so you should prefix it with b
import codecs
b = b"Aur\xc3\xa9lien"
b.decode('utf-8')
So you have the expected: 'Aurélien'
.
If you want to use s
, you should use mbcs
, latin-1
, mac_roman
or any 8-bit encoding. It doesn't matter. Such 8-bit codecs can get the binary character in your string correctly (a 1 to 1 mapping). So you get a byte array (and so now you can use the first part of this answers and so you can decode the binary string.
Upvotes: 0
Reputation: 3121
You have UTF-8
decoded as latin-1
, so the solution is to encode as latin-1
then decode as UTF-8
.
s = "Aur\xc3\xa9lien"
s.encode('latin-1').decode('utf-8')
print(s.encode('latin-1').decode('utf-8'))
Output
Aurélien
Upvotes: 1