Abhinav Anand
Abhinav Anand

Reputation: 639

Parsing the HTML content in email

I'm trying to write a python script to read my emails. I'm able to get most of the things properly like To, From, Subject. But in the body, I get the text as well as it's HTML code too as shown below.

enter image description here

Below is the part of code that does the extraction of content from the email

email_message = email.message_from_string(raw_email)
print 'To:', email_message['To']
print 'Sent from:', email_message['From']
print 'Date:', email_message['Date']
print 'Subject:', email_message['Subject']
print '*'*30, 'MESSAGE', '*'*30
maintype = email_message.get_content_maintype()
#print maintype

if maintype == 'multipart':
    for part in email_message.get_payload():
            if part.get_content_maintype() == 'text':
                print part.get_payload()
elif maintype == 'text':
    print email_message.get_payload()
print '*'*69

Git link for the complete code: Email-parser

How to get rid of that HTML code and get only the plain text?

Upvotes: 2

Views: 16056

Answers (1)

mti2935
mti2935

Reputation: 12027

The body of the message is MIME-encoded - that's why it contains the text in both plaintext and HTML formats. In order to get just the plaintext of the body, you first need to MIME-decode the message. You can use python's email package to do the MIME-decoding. Also, see this question for more information.

import email
import email.policy

with open("example.email", "rb") as f:
    msg = email.message_from_bytes(f.read(), policy=email.policy.default)

for part in msg.iter_parts():
    print(part.get_content()) # print part, decoding quotable

Upvotes: 5

Related Questions