Unexpected result when parsing with BeautifulSoup and regex

Question

I am playing around with the BeautifulSoup library. I was trying to parse an email from the website, but got an unexpected result. This is my code:

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

from bs4 import BeautifulSoup
import re
from urllib.parse import quote 

startUrl = "http://getrocketbook.com/pages/returns"
try:
    html = urlopen(quote((startUrl).encode('utf8'), ':/?%#_'))
    bsObj = BeautifulSoup(html, "html.parser")
    alls = bsObj.body.findAll(text=re.compile('[A-Za-z0-9\._+-]+@[A-Za-z0-9\.-]+'))
    for al in alls:
        print(al)
except HTTPError:
    pass
except URLError:
    pass

I expected to parse just an email, but I actually parsed this sentenced instead:

If you’ve done all of this and you still have not received your refund yet, please contact us at hello@getrocketbook.com.

Any idea what could I be doing wrong?

alecxe · Accepted Answer

This is because findAll() looks for actual elements or text nodes, not for separate words.

What you need to do is to apply the same compiled regular expression to the result:

pattern = re.compile('[A-Za-z0-9\._+-]+@[A-Za-z0-9\.-]+')
alls = bsObj.body.find_all(text=pattern)
for al in alls:
    print(pattern.search(al).group(0))

Also, since there is a single email there, see if you can use the find() method instead.

Unexpected result when parsing with BeautifulSoup and regex

Answers (1)

Related Questions