Reputation: 13
I want to return all recipients of an email using regex. For example:
Date: Wed, 6 Dec 2000 02:03:00 -0800 (PST)
From: [email protected]
To: [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
Subject: FW: If Santa Answered his mail...
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Donald W Herrick
X-To: [email protected], [email protected], [email protected], Kristi Demaiolo, Suresh Raghavan, Harry Arora
X-cc:
X-bcc:
Should return (from the "To: " line) [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
but not (from the "X-To: " line) [email protected], [email protected], [email protected] .
My current regex is re.findall([To:\s][\w\.-]+@[\w\.-]+, text)
which returns everything from the "To:", "X-To: " and "From: " line.
My questions:
[To:\s]
part of the regex?!Upvotes: 1
Views: 130
Reputation: 41838
regex
module: infinite lookbehind and other features
If you want to use regex, I suggest you use the outstanding regex
module instead of re
. This regex will return all matches:
(?<=(?<!X\S+)To:\s*(?:[^@\s]+@[^\,\s]+,\s*)*?)[^@\s]+@[^\,\s]+
Sample Code
I tested this in Python 3.4.
import regex
subject = """Date: Wed, 6 Dec 2000 02:03:00 -0800 (PST)
From: [email protected]
To: [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
Subject: FW: If Santa Answered his mail...
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Donald W Herrick
X-To: [email protected], [email protected], [email protected], Kristi Demaiolo, Suresh Raghavan, Harry Arora
X-cc:
X-bcc: """
pattern = "(?<=(?<!X\S+)To:\s*(?:[^@\s]+@[^\,\s]+,\s*)*?)[^@\s]+@[^\,\s]+"
for match in regex.finditer(pattern, subject):
print(match.group())
Output
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Explanation
[^@\s]+@[^\,\s]+
which matches any chars that are not an arrobas or whitespace char, then an arrobas, then any chars that are not a comma or whitespace char (the end-of-email delimiters in your input)(?<!X-)To:\s*
matches To:
as long as it is not preceded by Xsomething
, as asserted by the negative lookbehind (?<!X-)
(?:[^@\s]+@[^\,\s]+,\s*)*?
matches as few as needed (the *?
quantifiers) of the expression [^@\s]+@[^\,\s]+,\s*
to allow what follows the lookbehind to match. This is an "email skipper" that lets us gradually skip more and more emails with every match[^@\s]+@[^\,\s]+,\s*
is simply a crude email followed by a coma and optional white-space chars (the \s
matches not only spaces but also carriage returns, tabs etc.)Upvotes: 0
Reputation: 1121524
You've misunderstood what a character class does; your pattern matches anywhere a string contains a T
, o
, :
or whitespace character.
That's because [To:\s]
models a character class, any one character in the set will match. This is why your From:
line matches; the space between :
and d
suffices here.
If you need to validate the whole header name, anchor your match to the start of lines with ^
, but remove that character class:
r'^To:\s+[\w\.-]+@[\w\.-]+'
Now the To:
part only matches if at the start of a line, provided you use the re.MULTILINE
flag:
>>> import re
>>> text = '''\
... Date: Wed, 6 Dec 2000 02:03:00 -0800 (PST)
... From: [email protected]
... To: [email protected], [email protected],
... [email protected], [email protected],
... [email protected], [email protected]
... Subject: FW: If Santa Answered his mail...
... Mime-Version: 1.0
... Content-Type: text/plain; charset=us-ascii
... Content-Transfer-Encoding: 7bit
... X-From: Donald W Herrick
... X-To: [email protected], [email protected], [email protected], Kristi Demaiolo, Suresh Raghavan, Harry Arora
... X-cc:
... X-bcc:
... '''
>>> re.findall(r'^To:\s+[\w\.-]+@[\w\.-]+', text)
[]
>>> re.findall(r'^To:\s+[\w\.-]+@[\w\.-]+', text, flags=re.M)
['To: [email protected]']
This can only ever match the first email address, and only if it doesn't include anything like a full name (Brian Herrick <[email protected]>
, for example).
You'd have to match the whole header instead:
re.findall(r'^To:\s+((?:.*(?:\n[ \t]+)?)*)', text, flags=re.M)
This matches the To:
header followed by any number of header continuation lines (starting with whitespace):
>>> re.findall(r'^To:\s+((?:.*(?:\n[ \t]+)?)*)', text, flags=re.M)
['[email protected], [email protected], \n [email protected], [email protected], \n [email protected], [email protected]']
and you'd have to extract email addresses separately from that.
Personally, I'd be looking into the email
package instead, it would make it much easier to grab headers with that:
import email
message = email.message_from_string(text)
to_headers = message.get_all('to')
addresses = email.utils.getaddresses(to_headers)
Demo:
>>> import email
>>> m = email.message_from_string(text)
>>> m.get_all('to')
['[email protected], [email protected], \n [email protected], [email protected], \n [email protected], [email protected]']
>>> email.utils.getaddresses(m.get_all('to'))
[('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]')]
Now you have all the email addresses.
The email.utils.getaddresses()
function can also be applied when using the regular expression:
>>> email.utils.getaddresses(re.findall(r'^To:\s+((?:.*(?:\n[ \t]+)?)*)', text, flags=re.M))
[('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]')]
Upvotes: 0
Reputation: 63717
As an addendum to @MartijnPieters 's answer, regex may not be the right tool for the JOB. To parse an email message, it is recommended to use email.parser
>>> from email.parser import Parser
>>> headers = Parser().parsestr(email_str)
>>> pprint.pprint(map(str.strip, headers['to'].split()))
['[email protected],',
'[email protected],',
'[email protected],',
'[email protected],',
'[email protected],',
'[email protected]']
Upvotes: 2