user11322373
user11322373

Reputation:

safe regex to find emails from html

url ="https://www.siliconvalleypediatricdentistry.com/"
res=requests.get(url)
html=res.text
#re.findall(r'([\w0-9._-]+@[\w0-9._-]+\.[\w0-9_-]+)',html)
#re.findall(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)",html)

I found plenty of questions regarding this but most of them are extracting "wrong" emails

I am getting this as output

['[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]']

some of them are just JS scripts, is there a safer regex to use or module that does this?

Upvotes: 0

Views: 206

Answers (2)

Zorzi
Zorzi

Reputation: 792

That works for me:

re.findall(r'([\w-]+@[\w-]+\.[a-zA-Z]{1,5})',html)

Basically, we just force the end to be letters (e.g. .com), so the JS scripts don't match

Upvotes: 1

nikoola
nikoola

Reputation: 151

Just can try this:

r'^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,6})+$'

Or you can use your our own regex and just check if the email address are valid with:

from validate_email import validate_email
is_valid = validate_email('[email protected]')

Upvotes: 1

Related Questions