Reputation: 113
I created a code to extract emails from websites:
import requests
from bs4 import BeautifulSoup
import re
url = ""
s = requests.Session()
r = s.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"})
soup = BeautifulSoup(r.content, 'html.parser')
content = soup.get_text()
emails_match = re.findall(r'[\w\.-]+@[\w\.-]+', content)
it works fine, but sometimes return emails with other text inside it from other element. For example, if we try the code on this website: https://alliedsinterings.com/ it will returns phone number plus email:
print(email_match)
['[email protected]']
I want to get only the email address (without any text from other html elements)
when I try another regex, it returns the same, for example:
r'([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+){0,}'
Upvotes: 0
Views: 774
Reputation: 8302
Use .strings
instead of .text
import re
email = re.compile(r'([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+){0,}')
[x for x in soup.strings if email.search(x).group()]
['[email protected]']
Upvotes: 1