extract valid email address using regular expression and beautifulsoup

Question

I created a code to extract emails from websites:

import requests
from bs4 import BeautifulSoup
import re

url = ""
s = requests.Session()
r = s.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"})
soup = BeautifulSoup(r.content, 'html.parser')

content = soup.get_text()
emails_match = re.findall(r'[\w\.-]+@[\w\.-]+', content)

it works fine, but sometimes return emails with other text inside it from other element. For example, if we try the code on this website: https://alliedsinterings.com/ it will returns phone number plus email:

print(email_match)
['743-2538info@alliedsinterings.com']

I want to get only the email address (without any text from other html elements)

when I try another regex, it returns the same, for example:

r'([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+){0,}'

sushanth · Accepted Answer

Use .strings instead of .text

import re

email = re.compile(r'([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+){0,}')

[x for x in soup.strings if email.search(x).group()]

['info@alliedsinterings.com']

extract valid email address using regular expression and beautifulsoup

Answers (1)

Related Questions