Reputation:
url ="https://www.siliconvalleypediatricdentistry.com/"
res=requests.get(url)
html=res.text
#re.findall(r'([\w0-9._-]+@[\w0-9._-]+\.[\w0-9_-]+)',html)
#re.findall(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)",html)
I found plenty of questions regarding this but most of them are extracting "wrong" emails
I am getting this as output
['[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]']
some of them are just JS scripts, is there a safer regex to use or module that does this?
Upvotes: 0
Views: 206
Reputation: 792
That works for me:
re.findall(r'([\w-]+@[\w-]+\.[a-zA-Z]{1,5})',html)
Basically, we just force the end to be letters (e.g. .com
), so the JS scripts don't match
Upvotes: 1
Reputation: 151
Just can try this:
r'^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,6})+$'
Or you can use your our own regex and just check if the email address are valid with:
from validate_email import validate_email
is_valid = validate_email('[email protected]')
Upvotes: 1