Reputation: 6014
I am playing around with the BeautifulSoup library. I was trying to parse an email from the website, but got an unexpected result. This is my code:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
import re
from urllib.parse import quote
startUrl = "http://getrocketbook.com/pages/returns"
try:
html = urlopen(quote((startUrl).encode('utf8'), ':/?%#_'))
bsObj = BeautifulSoup(html, "html.parser")
alls = bsObj.body.findAll(text=re.compile('[A-Za-z0-9\._+-]+@[A-Za-z0-9\.-]+'))
for al in alls:
print(al)
except HTTPError:
pass
except URLError:
pass
I expected to parse just an email, but I actually parsed this sentenced instead:
If you’ve done all of this and you still have not received your refund yet, please contact us at [email protected].
Any idea what could I be doing wrong?
Upvotes: 2
Views: 42
Reputation: 473873
This is because findAll()
looks for actual elements or text nodes, not for separate words.
What you need to do is to apply the same compiled regular expression to the result:
pattern = re.compile('[A-Za-z0-9\._+-]+@[A-Za-z0-9\.-]+')
alls = bsObj.body.find_all(text=pattern)
for al in alls:
print(pattern.search(al).group(0))
Also, since there is a single email there, see if you can use the find()
method instead.
Upvotes: 4