Reputation: 425
I am trying to figure out how to improve the regex to only get emails
not ending with ".jpg"
and to remove --
from both left and right part of the emails if any is found. Example parameter as source
which is a string.
<html>
<body>
<p>[email protected]</p>
<p>[email protected]</p>
<p>[email protected]</p>
<p>[email protected]</p>
</body>
</html>
The result should contain: [email protected], [email protected], [email protected]
So basically, I want to see anyway to improve this function so the regex would could produce emails without -- and if possible improve the if not email[0].endswith('.png')
in case i want to add more, this could look urgly.
def extract_emails(source):
regex = re.compile(r'([\w\-\.]{1,100}@(\w[\w\-]+\.)+[\w\-]+)')
emails = list(set(regex.findall(source.decode("utf8"))))
all_emails = []
for email in emails:
if not email[0].endswith('.png') and not email[0].endswith('.jpg') \
and not email[0].endswith('.gif') and not email[0].endswith('.rar')\
and not email[0].endswith('.zip') and not email[0].endswith('.swf'):
all_emails.append(email[0].lower())
return list(set(all_emails))
Upvotes: 2
Views: 342
Reputation: 5302
I think top level domains are few so you can use alternation
s="""<html>
<body>
<p>[email protected]</p>
<p>[email protected]</p>
<p>[email protected]</p>
<p>[email protected]</p>
</body>
</html>"""
print re.findall(r"-*([\w\.]{1,100}@\w[\w\-]+\.+com|biz|us|bd)-*",s)
['[email protected]', '[email protected]', '[email protected]']
see DEMO
or try \w+@\w+\.(?!jpg|png)\w+\.*\w*
s="""<html>
<body>
<p>[email protected]</p>
<p>[email protected]</p>
<p>[email protected]</p>
<p>[email protected]</p>
</body>
</html>"""
print re.findall(r"\w+@\w+\.(?!jpg|png)\w+\.*\w*",s)
It is very hard to set constant regex for email verification- Details for email validation go at Using a regular expression to validate an email address it has 69 answers.
Upvotes: 2
Reputation: 61293
The best way to do this is using html parser like BeautifulSoup
In [37]: from bs4 import BeautifulSoup
In [38]: soup = BeautifulSoup('''<html>
....: <body>
....: <p>[email protected]</p>
....: <p>[email protected]</p>
....: <p>[email protected]</p>
....: <p>[email protected]</p>
....:
....: </body>
....: </html>''', 'lxml')
In [39]: [email.strip('-') for email in soup.stripped_strings if not email.endswith('.jpg')]
Out[39]: ['[email protected]', '[email protected]', '[email protected]']
Upvotes: 0
Reputation: 67988
x="""<html>
<body>
<p>[email protected]</p>
<p>[email protected]</p>
<p>[email protected]</p>
<p>[email protected]</p>
</body>
</html>"""
print re.findall(r"-*([\w\-\.]{1,100}@(?:\w[\w\-]+\.)+(?!jpg)[\w]+)-*",x)
Output:['[email protected]', '[email protected]', '[email protected]']
Upvotes: 1