Reputation: 83
I'm new to Python 2.7. Using regular expressions, I'm trying to extract from a text file just the emails from input lines. I am using the non-greedy method as the emails are repeated 2 times in the same line. Here is my code:
import re
f_hand = open('mail.txt')
for line in f_hand:
line.rstrip()
if re.findall('\S+@\S+?',line): print re.findall('\S+@\S+?',line)
however this is what i"m getting instead of just the email address:
['href="mailto:[email protected]">sercetary@a']
What shall I use in re.findall
to get just the email out?
Upvotes: 0
Views: 259
Reputation: 41
\S accepts many characters that aren't valid in an e-mail address. Try a regular expression of
[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+\\.[a-zA-Z0-9-_.]+
(presuming you are not trying to support Unicode -- it seems that you aren't since your input is a "text file").
This will require a "." in the server portion of the e-mail address, and your match will stop on the first character that is not valid within the e-mail address.
Upvotes: 1
Reputation: 1108
This is the format of an email address - https://www.rfc-editor.org/rfc/rfc5322#section-3.4.1.
Keeping that in mind the regex that you need is - r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)"
. (This works without having to depend on the text surrounding an email address.)
The following lines of code -
html_str = r'<a href="mailto:[email protected]">[email protected]</a>'
email_regex = r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)"
print re.findall(email_regex, html_str)
yields -
['[email protected]', '[email protected]']
P.S. - I got the regex for email addresses by googling for "email address regex" and clicking on the first site - http://emailregex.com/
Upvotes: 0
Reputation: 1877
try this
re.findall('mailto:(\S+@\S+?\.\S+)\"',str))
It should give you something like
['[email protected]']
Upvotes: 1
Reputation: 89557
If you parse a simple file with anchors for email addresses and always the same syntax (like double quotes to enclose attributes), you can use:
for line in f_hand:
print re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>', line)
(re.findall
returns only the capture group. \1
stands for the content of the first capture group.)
If the file is a more complicated html file, use a parser, extract the links and filter them.
Or eventually use XPath, something like: substring-after(//a/@href[starts-with(., "mailto:")], "mailto:")
Upvotes: 1
Reputation: 6173
\S
means not a space. "
and >
are not spaces.
You should use mailto:([^@]+@[^"]+)
as the regex (quoted form: 'mailto:([^@]+@[^"]+)'
). This will put the email address in the first capture group.
Upvotes: 1