max scender
max scender

Reputation: 113

extract valid email address using regular expression and beautifulsoup

I created a code to extract emails from websites:

import requests
from bs4 import BeautifulSoup
import re

url = ""
s = requests.Session()
r = s.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"})
soup = BeautifulSoup(r.content, 'html.parser')

content = soup.get_text()
emails_match = re.findall(r'[\w\.-]+@[\w\.-]+', content)

it works fine, but sometimes return emails with other text inside it from other element. For example, if we try the code on this website: https://alliedsinterings.com/ it will returns phone number plus email:

print(email_match)
['[email protected]']

I want to get only the email address (without any text from other html elements)

when I try another regex, it returns the same, for example:

r'([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+){0,}'

Upvotes: 0

Views: 774

Answers (1)

sushanth
sushanth

Reputation: 8302

Use .strings instead of .text

import re

email = re.compile(r'([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+){0,}')

[x for x in soup.strings if email.search(x).group()]

['[email protected]']

Upvotes: 1

Related Questions