user3797035
user3797035

Reputation: 13

Ignore strings which do not completely match regex?

I want to return all recipients of an email using regex. For example:

Date: Wed, 6 Dec 2000 02:03:00 -0800 (PST)
From: [email protected]
To: [email protected], [email protected], 
    [email protected], [email protected], 
    [email protected], [email protected]
Subject: FW: If Santa Answered his mail...
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Donald W Herrick
X-To: [email protected], [email protected], [email protected], Kristi Demaiolo, Suresh Raghavan, Harry Arora
X-cc: 
X-bcc: 

Should return (from the "To: " line) [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

but not (from the "X-To: " line) [email protected], [email protected], [email protected] .

My current regex is re.findall([To:\s][\w\.-]+@[\w\.-]+, text) which returns everything from the "To:", "X-To: " and "From: " line.

My questions:

  1. Why is the email address on the "From: " line also returned? It doesn't match the [To:\s] part of the regex?!
  2. How can I ensure that only email addresses which follow "To: " are returned? (That is, how do I exclude email addresses following "X-To: "? I think that you can use lookahead assertions for this but I am not sure how to do this?

Upvotes: 1

Views: 130

Answers (3)

zx81
zx81

Reputation: 41838

regex module: infinite lookbehind and other features

If you want to use regex, I suggest you use the outstanding regex module instead of re. This regex will return all matches:

(?<=(?<!X\S+)To:\s*(?:[^@\s]+@[^\,\s]+,\s*)*?)[^@\s]+@[^\,\s]+

Sample Code

I tested this in Python 3.4.

import regex
subject = """Date: Wed, 6 Dec 2000 02:03:00 -0800 (PST)
From: [email protected]
To: [email protected], [email protected], 
    [email protected], [email protected], 
    [email protected], [email protected]
Subject: FW: If Santa Answered his mail...
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Donald W Herrick
X-To: [email protected], [email protected], [email protected], Kristi Demaiolo, Suresh Raghavan, Harry Arora
X-cc: 
X-bcc: """
pattern = "(?<=(?<!X\S+)To:\s*(?:[^@\s]+@[^\,\s]+,\s*)*?)[^@\s]+@[^\,\s]+"

for match in regex.finditer(pattern, subject):
    print(match.group())

Output

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

Explanation

  • We have one big lookbehind, then a very basic email matcher: [^@\s]+@[^\,\s]+ which matches any chars that are not an arrobas or whitespace char, then an arrobas, then any chars that are not a comma or whitespace char (the end-of-email delimiters in your input)
  • That email matcher can be replaced by a more sophisticated email regex if need be
  • Now to the big lookbehind ``(?<=(?
  • The first part (?<!X-)To:\s* matches To: as long as it is not preceded by Xsomething, as asserted by the negative lookbehind (?<!X-)
  • The non-capture groups (?:[^@\s]+@[^\,\s]+,\s*)*? matches as few as needed (the *? quantifiers) of the expression [^@\s]+@[^\,\s]+,\s* to allow what follows the lookbehind to match. This is an "email skipper" that lets us gradually skip more and more emails with every match
  • [^@\s]+@[^\,\s]+,\s* is simply a crude email followed by a coma and optional white-space chars (the \s matches not only spaces but also carriage returns, tabs etc.)

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1121524

You've misunderstood what a character class does; your pattern matches anywhere a string contains a T, o, : or whitespace character.

That's because [To:\s] models a character class, any one character in the set will match. This is why your From: line matches; the space between : and d suffices here.

If you need to validate the whole header name, anchor your match to the start of lines with ^, but remove that character class:

r'^To:\s+[\w\.-]+@[\w\.-]+'

Now the To: part only matches if at the start of a line, provided you use the re.MULTILINE flag:

>>> import re
>>> text = '''\
... Date: Wed, 6 Dec 2000 02:03:00 -0800 (PST)
... From: [email protected]
... To: [email protected], [email protected], 
...     [email protected], [email protected], 
...     [email protected], [email protected]
... Subject: FW: If Santa Answered his mail...
... Mime-Version: 1.0
... Content-Type: text/plain; charset=us-ascii
... Content-Transfer-Encoding: 7bit
... X-From: Donald W Herrick
... X-To: [email protected], [email protected], [email protected], Kristi Demaiolo, Suresh Raghavan, Harry Arora
... X-cc: 
... X-bcc: 
... '''
>>> re.findall(r'^To:\s+[\w\.-]+@[\w\.-]+', text)
[]
>>> re.findall(r'^To:\s+[\w\.-]+@[\w\.-]+', text, flags=re.M)
['To: [email protected]']

This can only ever match the first email address, and only if it doesn't include anything like a full name (Brian Herrick <[email protected]>, for example).

You'd have to match the whole header instead:

re.findall(r'^To:\s+((?:.*(?:\n[ \t]+)?)*)', text, flags=re.M)

This matches the To: header followed by any number of header continuation lines (starting with whitespace):

>>> re.findall(r'^To:\s+((?:.*(?:\n[ \t]+)?)*)', text, flags=re.M)
['[email protected], [email protected], \n    [email protected], [email protected], \n    [email protected], [email protected]']

and you'd have to extract email addresses separately from that.

Personally, I'd be looking into the email package instead, it would make it much easier to grab headers with that:

import email

message = email.message_from_string(text)
to_headers = message.get_all('to')
addresses = email.utils.getaddresses(to_headers)

Demo:

>>> import email
>>> m = email.message_from_string(text)
>>> m.get_all('to')
['[email protected], [email protected], \n    [email protected], [email protected], \n    [email protected], [email protected]']
>>> email.utils.getaddresses(m.get_all('to'))
[('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]')]

Now you have all the email addresses.

The email.utils.getaddresses() function can also be applied when using the regular expression:

>>> email.utils.getaddresses(re.findall(r'^To:\s+((?:.*(?:\n[ \t]+)?)*)', text, flags=re.M))
[('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]'), ('', '[email protected]')]

Upvotes: 0

Abhijit
Abhijit

Reputation: 63717

As an addendum to @MartijnPieters 's answer, regex may not be the right tool for the JOB. To parse an email message, it is recommended to use email.parser

>>> from email.parser import Parser
>>> headers = Parser().parsestr(email_str)
>>> pprint.pprint(map(str.strip, headers['to'].split()))
['[email protected],',
 '[email protected],',
 '[email protected],',
 '[email protected],',
 '[email protected],',
 '[email protected]']

Upvotes: 2

Related Questions