Reputation: 197
Im writing a python program using regex to find email addresses. re.findall function is giving wrong output whenever I try to use round brackets for grouping. Can anyone point out the mistake / suggest an alternate solution?
Here are two snippets of code to explain -
pat = "[\w]+[ ]*@[ ]*[\w]+.[\w]+"
re.findall(pat, '[email protected] .rtrt.. [email protected] ')
gives the output
['[email protected]', '[email protected]']
However, if I use grouping in this regex and modify the code as
pat = "[\w]+[ ]*@[ ]*[\w]+(.[\w]+)*"
re.findall(pat, '[email protected] .rtrt.. [email protected] ')
the output is
['.com', '.com']
To confirm the correctness of the regex, I tried this specific regex (in second example) in http://regexpal.com/ with the same input string, and both the email addresses are matched successfully.
Upvotes: 1
Views: 581
Reputation: 8758
You would use groups if you wanted to do something like separate the user from the host:
(The hyphens are optional, some emails have them.)
pat = '([\w\.-]+)@([\w\.-]+)'
re.findall(pat, '[email protected] .rtrt.. [email protected] ')
Output:
[('abc', 'cs.stansoft.edu.com'), ('myacc', 'gmail.com')]
To further illustrate we could replace the host, and keep the user from group 1 (\1):
emails = '[email protected] .rtrt.. [email protected] '
pat = '([\w\.-]+)@([\w\.-]+)'
re.sub(pat, r'\[email protected]', emails)
Output:
'[email protected] .rtrt.. [email protected] '
Simply remove the parentheses from the pattern to match the whole email:
pat = '[\w\.-]+@[\w\.-]+'
re.findall(pat, '[email protected] .rtrt.. [email protected] ')
Output:
['[email protected]', '[email protected]']
Upvotes: 1
Reputation: 102066
In Python, re.findall
returns the whole match only if there are no groups, if there are groups then it will return the groups. To get around this, you should use a non-capturing group (?:...)
. In this case:
pat = "[\w.]+ *@ *\w+(?:\.\w+)*"
re.findall(pat, '[email protected] .rtrt.. [email protected] ')
Upvotes: 3