Reputation: 196
I have a lot of strings from mail bodies, that print as such:
=C3=A9
This should be 'é' for example.
What exactly is this encoding and how to decode it?
I'm using python 3.5
EDIT:
I managed to get the body of the mail properly encoded by applying:
quopri.decodestring(sometext).decode('utf-8')
However I still struggle to get the FROM , TO, SUBJECT, etc... parts get right.
This is how I construct the e-mails:
import imaplib
import email
import quopri
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('[email protected]', '*******')
mail.list()
mail.select('"[Gmail]/All Mail"')
typ, data = mail.search(None, 'SUBJECT', '"{}"'.format('123456'))
data[0].split()
print(data[0].split())
for e_mail in data[0].split():
typ, data = mail.fetch('{}'.format(e_mail.decode()),'(RFC822)')
raw_mail = data[0][1]
email_message = email.message_from_bytes(raw_mail)
if email_message.is_multipart():
for part in email_message.walk():
if part.get_content_type() == 'text/plain':
if part.get_content_type() == 'text/plain':
body = part.get_payload()
to = email_message['To']
utf = quopri.decodestring(to)
text = utf.decode('utf-8')
print(text)
.
.
.
I still got this: =?UTF-8?B?UMOpdGVyIFBldMWRY3o=?=
Upvotes: 3
Views: 2413
Reputation: 196
This solved it:
from email.header import decode_header
def mail_header_decoder(self,header):
if header != None:
mail_header_decoded = decode_header(header)
l=[]
header_new=[]
for header_part in mail_header_decoded:
l.append(header_part[1])
if all(item == None for item in l):
# print(header)
return header
else:
for header_part in mail_header_decoded:
header_new.append(header_part[0].decode())
header_new = ''.join(header_new) # convert list to string
# print(header_new)
return header_new
Upvotes: 2
Reputation: 5828
That's called "quoted-printable" encoding. It's defined by RFC 1521. Its purpose is to replace unusual character values by a sequence of normal, safe characters so that the message can be handled safely by the email system.
In fact there are two levels of encoding here. First the letter 'é'
was encoded into UTF-8 which produces '\xc3\xa9'
, and then that UTF-8 was encoded into the quoted-printable form '=C3=A9'
You can undo the quoted-printable step by using the decode
or decodestring
method of the quopri
module, documented at https://docs.python.org/3/library/quopri.html That will look something like:
import quopri
source = '=C3=A9'
print(quopri.decodestring(source))
That will undo the quoted-printable encoding and show you the UTF-8 bytes '\xc3\xa9'
. To get back to the letter 'é'
you need to use the decode
string method and tell Python that those bytes contain a UTF-8 encoding, something like:
utf = quopri.decodestring(source)
text = utf.decode('utf-8')
print(text)
UTF-8 is only one of many possible ways of encoding letters into bytes. For example, if your 'é'
had been encoded as ISO-8859-1 it would have had the byte value '\xe9'
and its quoted-printable representation would have been '=E9'
.
When you're dealing with email, you should see a Content-Type header that tells you what type of content is being sent and which letter-to-bytes encoding was applied to the text of the message (or to an individual MIME part, in a multipart message). If that text was then encoded again by applying the quoted-printable encoding, that additional step should be indicated by a Content-Transfer-Encoding header. So your message with UTF-8 encoded text carried in quoted-printable format should have had headers that look like this:
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Upvotes: 3