Peter Petocz
Peter Petocz

Reputation: 196

Python: What is this encoding and how to decode?

I have a lot of strings from mail bodies, that print as such:

=C3=A9

This should be 'é' for example.

What exactly is this encoding and how to decode it?

I'm using python 3.5

EDIT:

I managed to get the body of the mail properly encoded by applying:

quopri.decodestring(sometext).decode('utf-8') 

However I still struggle to get the FROM , TO, SUBJECT, etc... parts get right.

This is how I construct the e-mails:

import imaplib
import email
import quopri


mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('[email protected]', '*******')
mail.list()

mail.select('"[Gmail]/All Mail"') 



typ, data = mail.search(None, 'SUBJECT', '"{}"'.format('123456'))

data[0].split()

print(data[0].split())

for e_mail in data[0].split():
    typ, data = mail.fetch('{}'.format(e_mail.decode()),'(RFC822)')
    raw_mail = data[0][1]
    email_message = email.message_from_bytes(raw_mail)
    if email_message.is_multipart():
        for part in email_message.walk():
            if part.get_content_type() == 'text/plain':
                if part.get_content_type() == 'text/plain':
                    body = part.get_payload()
                    to = email_message['To']

                    utf = quopri.decodestring(to)

                    text = utf.decode('utf-8')
                    print(text)
.
.
.

I still got this: =?UTF-8?B?UMOpdGVyIFBldMWRY3o=?=

Upvotes: 3

Views: 2413

Answers (2)

Peter Petocz
Peter Petocz

Reputation: 196

This solved it:

from email.header import decode_header
def mail_header_decoder(self,header):
        if header != None:
            mail_header_decoded = decode_header(header)
            l=[]  
            header_new=[]
            for header_part in mail_header_decoded: 
                l.append(header_part[1])

            if all(item == None for item in l):
                # print(header)
                return header
            else:
                for header_part in mail_header_decoded:
                    header_new.append(header_part[0].decode())
                header_new = ''.join(header_new) # convert list to string
                # print(header_new)
                return header_new

Upvotes: 2

ottomeister
ottomeister

Reputation: 5828

That's called "quoted-printable" encoding. It's defined by RFC 1521. Its purpose is to replace unusual character values by a sequence of normal, safe characters so that the message can be handled safely by the email system.

In fact there are two levels of encoding here. First the letter 'é' was encoded into UTF-8 which produces '\xc3\xa9', and then that UTF-8 was encoded into the quoted-printable form '=C3=A9'

You can undo the quoted-printable step by using the decode or decodestring method of the quopri module, documented at https://docs.python.org/3/library/quopri.html That will look something like:

    import quopri

    source = '=C3=A9'
    print(quopri.decodestring(source))

That will undo the quoted-printable encoding and show you the UTF-8 bytes '\xc3\xa9'. To get back to the letter 'é' you need to use the decode string method and tell Python that those bytes contain a UTF-8 encoding, something like:

    utf = quopri.decodestring(source)
    text = utf.decode('utf-8')
    print(text)

UTF-8 is only one of many possible ways of encoding letters into bytes. For example, if your 'é' had been encoded as ISO-8859-1 it would have had the byte value '\xe9' and its quoted-printable representation would have been '=E9'.

When you're dealing with email, you should see a Content-Type header that tells you what type of content is being sent and which letter-to-bytes encoding was applied to the text of the message (or to an individual MIME part, in a multipart message). If that text was then encoded again by applying the quoted-printable encoding, that additional step should be indicated by a Content-Transfer-Encoding header. So your message with UTF-8 encoded text carried in quoted-printable format should have had headers that look like this:

Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

Upvotes: 3

Related Questions