Obie
Obie

Reputation: 111

beautiful soup regex

I just ran the following code in Python to take all of the certain emails out of an IMAP folder. The extraction part works fine and the BeautifulSoup part works okay, but the output has a lot of '\r' and '\n' within.

I tried to remove these with REGEX sub function but it's not working...not even giving an error message. Any idea what is wrong? I am attaching the code...please note (this is not complete code but everything above the code I'm posting works okay. It still prints the output, it's "prettified", but the \r and \n are still there. Have tried with find_all() but that doesn't work either.

mail.list()  # Lists all labels in GMail
mail.select('INBOX/Personal')  # Connected to inbox.

resp, items = mail.search(None, '(SEEN)')

items = items[0].split()  # getting the mails id        
for emailid in items:
    # getting the mail content
    resp, data = mail.fetch(emailid, '(UID BODY[TEXT])')
    text = str(data[0])  # [1] don't forget to add this back
    soup = bs(text, 'html.parser')
    soup = soup.prettify()
    soup = re.sub('\\r\\n', '', soup)

print(soup)

Upvotes: 7

Views: 447

Answers (2)

silgon
silgon

Reputation: 7221

What about replace command directly? Since it is not regex, it should be faster.

soup.replace("\n","").replace("\r","")

Upvotes: 2

MasOOd.KamYab
MasOOd.KamYab

Reputation: 984

You can use this for one line regex statement:

soup = re.sub('\\r*n*', '', soup)

or you can use this:

soup = re.sub('\\r', '', soup)
soup = re.sub('\\n', '', soup)

https://regexr.com/3nnp1

Upvotes: 4

Related Questions