RasmusP_963
RasmusP_963

Reputation: 320

Converting latin-1 encoded UTF-8 string in Python

I'm using a Python 2.x-library email to iterate over some .eml-files, but I have Python 3.x installed.

I extract the filename in the header of each payload (attachment) using .get_filename(). Encoding is not set in the header and thus I believe Python 3.x interprets the returned string as utf-8. The string however looks like this, when it contains special characters, e.g. like "ø":

=?ISO-8859-1?Q?Sp=F8rgeskema=2Edoc?=

I have failed in numerous ways to convert this string into utf-8 making it into bytes or not and de- and encoding using latin-1, ISO-8859-1 (should be the same though) and utf-8.

I've also tried using:

ast.literal_eval(r"b'=?ISO-8859-1?Q?Sp=F8rgeskema=2Edoc?='")

and decoding that, but it still returns the original string containing the encoded characters.

How do one go about this?

Upvotes: 0

Views: 679

Answers (1)

Giacomo Catenazzi
Giacomo Catenazzi

Reputation: 9523

You are handling email, so you can use email handling functions:

Try with https://docs.python.org/3.5/library/email.header.html. The last example (and second one, very small module:

>>> from email.header import decode_header
>>> decode_header('=?iso-8859-1?q?p=F6stal?=')
[(b'p\xf6stal', 'iso-8859-1')]

There is also a version for python 2.7.

So for your case:

subj = '=?ISO-8859-1?Q?Sp=F8rgeskema=2Edoc?='
subject, encoder = email.header.decode_header(subj)[0]
print(subject.decode(encoder))

Upvotes: 2

Related Questions