Lzypenguin
Lzypenguin

Reputation: 955

Issue scraping HTML from gmail

I am trying to scrape HTML from my gmail email. I am using the email package, and beautiful soup to get the data. For some reason it seems like when i am going over the email directly from the company that sends it to me, the HTML is returned like this:

PCFET0NUWVBFIGh0bWwgUFVCTElDICItLy93M2MvL2R0ZCB4aHRtbCAxLjAgdHJhbnNpdGlvbmFs
Ly9lbiIgImh0dHA6Ly93d3cudzMub3JnL3RyL3hodG1sMS9kdGQveGh0bWwxLXRyYW5zaXRpb25h
bC5kdGQiPjxodG1sIHN0eWxlPSJtYXJnaW46IDA7cGFkZGluZzogMDtmb250LWZhbWlseTogJ0hl
bHZldGljYSBOZXVlJywgJ0hlbHZldGljYScsIEhlbHZldGljYSwgQXJpYWwsIHNhbnMtc2VyaWY7
Ym94LXNpemluZzogYm9yZGVyLWJveCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0
bWwiPjxoZWFkIHN0eWxlPSJtYXJnaW46IDA7cGFkZGluZzogMDtmb250LWZhbWlseTogJ0hlbHZl
dGljYSBOZXVlJywgJ0hlbHZldGljYScsIEhlbHZldGljYSwgQXJpYWwsIHNhbnMtc2VyaWY7Ym94
LXNpemluZzogYm9yZGVyLWJveCI+CiAgICA8bWV0YSBzdHlsZT0ibWFyZ2luOiAwO3BhZGRpbmc6
IDA7Zm9udC1mYW1pbHk6ICdIZWx2ZXRpY2EgTmV1ZScsICdIZWx2ZXRpY2EnLCBIZWx2ZXRpY2Es
IEFyaWFsLCBzYW5zLXNlcmlmO2JveC1zaXppbmc6IGJvcmRlci1ib3giIGh0dHAtZXF1aXY9IkNv
bnRlbnQtVHlwZSIgY29udGVudD0idGV4dC9odG1sOyBjaGFyc2V0PVVURi04IiAvPgogICAgPHRp

This is the code I am running to get the data above.

def grab_email(most_recent):
    result2, email_data = mail.uid('fetch', most_recent, '(RFC822)')
    raw_email = email_data[0][1].decode('utf-8')
    email_message = email.message_from_string(raw_email)
    return email_message

def get_data(email_message):
    for part in email_message.walk():
        content_type = part.get_content_type()
        if 'html' in content_type:
            html_ = part.get_payload()
            soup = BeautifulSoup(html_, 'lxml')
            text = soup.get_text()
            print(text)

When the email comes from the original source, my code returns the first section above with random numbers and letters. But if i forward the email to myself, so the code goes over it a second time, it works perfectly and extracts the information exactly like it is supposed to. Any help figuring this out would be awesome!

Upvotes: 1

Views: 200

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195508

The data you see is base64 encoded. To decode it, use base64 module from standard library:

import base64

txt = '''PCFET0NUWVBFIGh0bWwgUFVCTElDICItLy93M2MvL2R0ZCB4aHRtbCAxLjAgdHJhbnNpdGlvbmFs
Ly9lbiIgImh0dHA6Ly93d3cudzMub3JnL3RyL3hodG1sMS9kdGQveGh0bWwxLXRyYW5zaXRpb25h
bC5kdGQiPjxodG1sIHN0eWxlPSJtYXJnaW46IDA7cGFkZGluZzogMDtmb250LWZhbWlseTogJ0hl
bHZldGljYSBOZXVlJywgJ0hlbHZldGljYScsIEhlbHZldGljYSwgQXJpYWwsIHNhbnMtc2VyaWY7
Ym94LXNpemluZzogYm9yZGVyLWJveCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGh0
bWwiPjxoZWFkIHN0eWxlPSJtYXJnaW46IDA7cGFkZGluZzogMDtmb250LWZhbWlseTogJ0hlbHZl
dGljYSBOZXVlJywgJ0hlbHZldGljYScsIEhlbHZldGljYSwgQXJpYWwsIHNhbnMtc2VyaWY7Ym94
LXNpemluZzogYm9yZGVyLWJveCI+CiAgICA8bWV0YSBzdHlsZT0ibWFyZ2luOiAwO3BhZGRpbmc6
IDA7Zm9udC1mYW1pbHk6ICdIZWx2ZXRpY2EgTmV1ZScsICdIZWx2ZXRpY2EnLCBIZWx2ZXRpY2Es
IEFyaWFsLCBzYW5zLXNlcmlmO2JveC1zaXppbmc6IGJvcmRlci1ib3giIGh0dHAtZXF1aXY9IkNv
bnRlbnQtVHlwZSIgY29udGVudD0idGV4dC9odG1sOyBjaGFyc2V0PVVURi04IiAvPgogICAgPHRp'''


print(base64.b64decode(txt))

Prints:

b'<!DOCTYPE html PUBLIC "-//w3c//dtd xhtml 1.0 transitional//en" "http://www.w3.org/tr/xhtml1/dtd/xhtml1-transitional.dtd"><html style="margin: 0;padding: 0;font-family: \'Helvetica Neue\', \'Helvetica\', Helvetica, Arial, sans-serif;box-sizing: border-box" xmlns="http://www.w3.org/1999/xhtml"><head style="margin: 0;padding: 0;font-family: \'Helvetica Neue\', \'Helvetica\', Helvetica, Arial, sans-serif;box-sizing: border-box">\n    <meta style="margin: 0;padding: 0;font-family: \'Helvetica Neue\', \'Helvetica\', Helvetica, Arial, sans-serif;box-sizing: border-box" http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n    <ti'

Upvotes: 1

Related Questions