Alex
Alex

Reputation: 44305

How to decode a text in python3?

I have a text Aur\xc3\xa9lien and want to decode it with python 3.8.

I tried the following

import codecs
s = "Aur\xc3\xa9lien"
codecs.decode(s, "urf-8")
codecs.decode(bytes(s), "urf-8")
codecs.decode(bytes(s, "utf-8"), "utf-8")

but none of them gives the correct result Aurélien.

How to do it correctly?

And is there no basic, general authoritative simple page that describes all these encodings for python?

Upvotes: 0

Views: 9945

Answers (3)

jgphilpott
jgphilpott

Reputation: 649

First find the encoding of the string and then decode it... to do this you will need to make a byte string by adding the letter 'b' to the front of the original string.

Try this:

import chardet

s = "Aur\xc3\xa9lien"
bs = b"Aur\xc3\xa9lien"

encoding = chardet.detect(bs)["encoding"]

str = s.encode(encoding).decode("utf-8")

print(str)

If you are reading the text from a file you can detect the encoding using the magic lib, see here: https://stackoverflow.com/a/16203777/1544937

Upvotes: 3

Giacomo Catenazzi
Giacomo Catenazzi

Reputation: 9523

Your string is not a Unicode sequence, so you should prefix it with b

import codecs
b = b"Aur\xc3\xa9lien"
b.decode('utf-8')

So you have the expected: 'Aurélien'.

If you want to use s, you should use mbcs, latin-1, mac_roman or any 8-bit encoding. It doesn't matter. Such 8-bit codecs can get the binary character in your string correctly (a 1 to 1 mapping). So you get a byte array (and so now you can use the first part of this answers and so you can decode the binary string.

Upvotes: 0

mhhabib
mhhabib

Reputation: 3121

You have UTF-8 decoded as latin-1, so the solution is to encode as latin-1 then decode as UTF-8.

s = "Aur\xc3\xa9lien"
s.encode('latin-1').decode('utf-8')
print(s.encode('latin-1').decode('utf-8'))

Output
Aurélien

Upvotes: 1

Related Questions