safe regex to find emails from html

Question

url ="https://www.siliconvalleypediatricdentistry.com/"
res=requests.get(url)
html=res.text
#re.findall(r'([\w0-9._-]+@[\w0-9._-]+\.[\w0-9_-]+)',html)
#re.findall(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)",html)

I found plenty of questions regarding this but most of them are extracting "wrong" emails

I am getting this as output

['8b4e078a51d04e0e9efdf470027f0ec1@sentry.wixpress.com',
 'core-js-bundle@3.2.1',
 'whatwg-fetch@3.0.0',
 'requirejs-bolt@2.3.6',
 'svpdinfo@gmail.com',
 'svpdinfo@gmail.com',
 'SVPDinfo@gmail.com']

some of them are just JS scripts, is there a safer regex to use or module that does this?

Zorzi · Accepted Answer

That works for me:

re.findall(r'([\w-]+@[\w-]+\.[a-zA-Z]{1,5})',html)

Basically, we just force the end to be letters (e.g. .com), so the JS scripts don't match

safe regex to find emails from html

Answers (2)

Related Questions