anu.agg
anu.agg

Reputation: 197

re.findall failing for regex with grouping in Python

Im writing a python program using regex to find email addresses. re.findall function is giving wrong output whenever I try to use round brackets for grouping. Can anyone point out the mistake / suggest an alternate solution?

Here are two snippets of code to explain -

pat = "[\w]+[ ]*@[ ]*[\w]+.[\w]+"
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

gives the output

['[email protected]', '[email protected]']

However, if I use grouping in this regex and modify the code as

pat = "[\w]+[ ]*@[ ]*[\w]+(.[\w]+)*"
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

the output is

['.com', '.com']

To confirm the correctness of the regex, I tried this specific regex (in second example) in http://regexpal.com/ with the same input string, and both the email addresses are matched successfully.

Upvotes: 1

Views: 581

Answers (2)

Honest Abe
Honest Abe

Reputation: 8758

You would use groups if you wanted to do something like separate the user from the host:
(The hyphens are optional, some emails have them.)

pat = '([\w\.-]+)@([\w\.-]+)'
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

Output:

[('abc', 'cs.stansoft.edu.com'), ('myacc', 'gmail.com')]

To further illustrate we could replace the host, and keep the user from group 1 (\1):

emails = '[email protected] .rtrt.. [email protected] '
pat = '([\w\.-]+)@([\w\.-]+)'
re.sub(pat, r'\[email protected]', emails)

Output:

'[email protected] .rtrt.. [email protected] '

Simply remove the parentheses from the pattern to match the whole email:

pat = '[\w\.-]+@[\w\.-]+'
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

Output:

['[email protected]', '[email protected]']

Upvotes: 1

huon
huon

Reputation: 102066

In Python, re.findall returns the whole match only if there are no groups, if there are groups then it will return the groups. To get around this, you should use a non-capturing group (?:...). In this case:

pat = "[\w.]+ *@ *\w+(?:\.\w+)*"
re.findall(pat, '[email protected] .rtrt.. [email protected] ')

Upvotes: 3

Related Questions