Reputation: 809

Python regex: capturing group captures/overrides subsequent matches

In a regex, how can I match any number of any character (e.g., (.|\n)*) without consuming other matches that could follow? If that question isn't clear, here is my situation:

In a text file, I have a bunch of emails including headers all pasted together.

Edit: The cleaner version below has each header at the start of a newline. That may or may not be the case with my actual data. Each header component (like 'From: xxx') may be preceded by anything or nothing. In some instances, many emails and headers could all be on one line, after a bunch of other cruft. On top of that, there are other email headers I'll need to recognize that include 'From:' in them. So, I need to recognize this entire header style.

Several answers below given before my edit rely on things like ^ or tab separation, which I can't count on. They seem like they might work with a little modification, but I'm (obviously) not great with regex and I've been unable to adjust them myself. I'm sorry for omitting this before, only for several answerers to seize on it... another product of my inexperience with regexes.

Here is an ugly version - this is a string i'm actually trying to match. It contains two headers and messages to pull out.

emailsString = u"""From:\n     Lastname, Firstname\n     Sent:\n     Monday, June 24, 2013 1:48 PM\n     To:\n     Othername, Name\n     Subject:\n     RE: Center update\n    Message message message.\n    Such a lovely message\n    Take care,\n    Firstname Lastname, MS\n     Long signature\n     in this email\n   \n    E-mail:\n     [email protected]\n     Web\n     my blog\n     From:\n     Lastname, Firstname\n     Sent:\n     Monday, June 24, 2013 9:33 AM\n     To:\n     Othername, Name\n     Subject:\n     Center update\n     Importance:\n     High\n    Good Morning Name,\n    I hope this finds you doing well.\n    I wanted to inform you of some changes. The Center will be closing August 30\n     th\n     .  or September 1\n     st\n     .  I\u2019ve enjoyed my experience. """

Here is a cleaner version to show what the headers look like

From: Lastname, Firstname
Sent: Monday, July 15th, 2011, 9:36 AM
To: Othername, Name
Subject: blah
Importance: High

Message message message
second line of message

second para of message

From: Lastname, Firstname
Sent: Thursday, July 18th, 2011, 10:45 AM
To: Othername, Name
Subject: blahblah

message

...

I'm trying to regex out the information in the headers along with the message itself. I have a regex that can successfully match all the headers, but I'm struggling with the message. The problem is, a message can contain anything (or nothing). There could be multiple newlines, etc. I want to get all of this, but I still want to split up the emails. My attempt (note that the 'Important' part of the header is optional):

for hit in re.finditer(r'[\s\n]*From:[\s\n]*(?P<from>.*)[\s\n]*Sent:[\s\n]*(?P<date>.*)[\s\n]*To:[\s\n]*(?P<to>.*)[\s\n]*Subject:[\s\n]*(?P<subject>.*)[\s\n]*(?:Importance:)?[\s\n]*.*[\s\n]*(?P<message>(.|\n)*)', allEmailsString):
    print "from: " + hit.group("from")
    print "to: " + hit.group("to")
    print "date: " + hit.group("date")
    print "subject: " + hit.group("subject")
    print "message: " + hit.group("message")

The problem is, the message group is grabbing everything. So, I get the first email header's from/to/etc correctly, and then see a message containing that emails message, along with all following email headers and messages. I need to grab 'everything up until the next email header/regex match or until the end of the string'.

I've already got a workaround - I can get rid of the message capturing group and grab only the headers. Then, iterate through the match objects and slice the string based on their start/end. E.g., message1 is from match1.end up to match2.start.

So, I'm asking...

Is there a way I can do this with capturing groups in my regex instead?
Is there a better workaround?

Upvotes: 0

Answers (3)

user557597

Reputation:

This might be painfull to look at. Its expanded for clarity.
Use Multi-Line mode, and No-DotAll.

@mobabo - Edit to this after your first comment.

There must be a clear delineation of your keywords, and there is. Your statement of
I can't count on things like '^From' to work shows you didn't look at the previous
regex, and that part is the same in this one. ^[^\S\n]*From: is not the same as ^From

Additionally, there is no clear delineation between Subject and Message
or Importance and Message. If 'Importance' is part of the email, the Subject has an end point.

I've made a regex that handles your dirty and clean emails, at the bottom is a Perl
program that exercises it. The output is included. See if that can solve your issues
(see below).

Unfortunately, this is the best you can hope for.

Good Luck Sir!
(A Note - if Python had recursion, this regex would be 1/4 th this size)

 # Compressed
 # -------------------
 #  ^[^\S\n]*From:\s*(?P<from>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*)(?:\s*^[^\S\n]*Sent:\s*(?P<sent>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*To:\s*(?P<to>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*Subject:\s*(?P<subject>(?:(?!\s*^[^\S\n]*(?:(?:From|Sent|To|Subject|Importance)):)[\S\s])*)(?:\s*^[^\S\n]*Importance:\s*(?P<importance>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?)?

 # Expanded
 # -------------------
 #

 ^ [^\S\n]* From: \s* 
 (?P<from>
      (?:
           (?!
                \s* ^ [^\S\n]* 
                (?: From | Sent | To | Subject | Importance )
                :
           )
           [\S\s] 
      )*
 )

 (?:
      \s* ^ [^\S\n]* Sent: \s* 
      (?P<sent>
           (?:
                (?!
                     \s* ^ [^\S\n]* 
                     (?: From | Sent | To | Subject | Importance )
                     :
                )
                [\S\s] 
           )*
      )
 )?

 (?:
      \s* ^ [^\S\n]* To: \s* 
      (?P<to>
           (?:
                (?!
                     \s* ^ [^\S\n]* 
                     (?: From | Sent | To | Subject | Importance )
                     :
                )
                [\S\s] 
           )*
      )
 )?

 (?:
      \s* ^ [^\S\n]* Subject: \s* 
      (?P<subject>
           (?:
                (?!
                     \s* ^ [^\S\n]* 
                     (?:
                          (?: From | Sent | To | Subject | Importance )
                     )
                     :
                )
                [\S\s] 
           )*
      )

      (?:
           \s* ^ [^\S\n]* Importance: \s* 
           (?P<importance>
                (?:
                     (?!
                          \s* ^ [^\S\n]* 
                          (?: From | Sent | To | Subject | Importance )
                          :
                     )
                     [\S\s] 
                )*
           )
      )?
 )?


 # // Output from Perl sample code (below)
 # //
 # // ======================
 # // From:
 # //         Lastname, Firstname
 # // Sent:
 # //         Monday, July 15th, 2011, 9:36 AM
 # // To:
 # //         Othername, Name
 # // Subject:
 # //         blah
 # // Importance/Message:
 # //         High
 # // 
 # // Message message message
 # // second line of message
 # // 
 # // second para of message
 # // 
 # // 
 # // ======================
 # // From:
 # //         Lastname, Firstname
 # // Sent:
 # //         Thursday, July 18th, 2011, 10:45 AM
 # // To:
 # //         Othername, Name
 # // Subject/Message:
 # //         blahblah
 # // 
 # // message
 # // 
 # // 
 # // ======================
 # // From:
 # //         Lastname, Firstname
 # // Sent:
 # //         Monday, June 24, 2013 1:48 PM
 # // To:
 # //         Othername, Name
 # // Subject/Message:
 # //         RE: Center update
 # //     Message message message.
 # //     Such a lovely message
 # //     Take care,
 # //     Firstname Lastname, MS
 # //      Long signature
 # //      in this email
 # // 
 # //     E-mail:
 # //      [email protected]
 # //      Web
 # //      my blog
 # // 
 # // 
 # // ======================
 # // From:
 # //         Lastname, Firstname
 # // Sent:
 # //         Monday, June 24, 2013 9:33 AM
 # // To:
 # //         Othername, Name
 # // Subject:
 # //         Center update
 # // Importance/Message:
 # //         High
 # //     Good Morning Name,
 # //     I hope this finds you doing well.
 # //     I wanted to inform you of some changes. The Center will be closing August 30
 # // 
 # //      th
 # //      .  or September 1
 # //      st
 # //      .  I've enjoyed my experience.
 # // 

 # ------------------------------------------------------------
 # # Perl sample code
 # use strict;
 # use warnings;
 # 
 # $/ = undef;
 # 
 # my $str = <DATA>;
 # 
 # 
 # 
 # while ( $str =~ /
 #     ^[^\S\n]*From:\s*(?P<from>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*)(?:\s*^[^\S\n]*Sent:\s*(?P<sent>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*To:\s*(?P<to>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*Subject:\s*(?P<subject>(?:(?!\s*^[^\S\n]*(?:(?:From|Sent|To|Subject|Importance)):)[\S\s])*)(?:\s*^[^\S\n]*Importance:\s*(?P<importance>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?)?
 # /xmg)
 # 
 # {
 #  print "\n\n======================\n";
 #  print "From: \n\t$+{from}\n";
 #  if (defined $+{sent})
 #  {
 #      print "Sent: \n\t$+{sent}\n";
 #  }
 #  if (defined $+{to})
 #  {
 #      print "To: \n\t$+{to}\n";
 #  }
 #  if (defined $+{importance})
 #  {
 #      print "Subject: \n\t$+{subject}\n";
 #      print "Importance/Message: \n\t$+{importance}\n";
 #  }
 #  elsif (defined $+{subject})
 #  {
 #      print "Subject/Message: \n\t$+{subject}\n";
 #  }
 # }
 # 
 # 
 # __DATA__
 # 
 # From: Lastname, Firstname
 # Sent: Monday, July 15th, 2011, 9:36 AM
 # To: Othername, Name
 # Subject: blah
 # Importance: High
 # 
 # Message message message
 # second line of message
 # 
 # second para of message
 # 
 # From: Lastname, Firstname
 # Sent: Thursday, July 18th, 2011, 10:45 AM
 # To: Othername, Name
 # Subject: blahblah
 # 
 # message
 # 
 # 
 # 
 # 
 # 
 # From:
 #      Lastname, Firstname
 #      Sent:
 #      Monday, June 24, 2013 1:48 PM
 #      To:
 #      Othername, Name
 #      Subject:
 #      RE: Center update
 #     Message message message.
 #     Such a lovely message
 #     Take care,
 #     Firstname Lastname, MS
 #      Long signature
 #      in this email
 #    
 #     E-mail:
 #      [email protected]
 #      Web
 #      my blog
 #      From:
 #      Lastname, Firstname
 #      Sent:
 #      Monday, June 24, 2013 9:33 AM
 #      To:
 #      Othername, Name
 #      Subject:
 #      Center update
 #      Importance:
 #      High
 #     Good Morning Name,
 #     I hope this finds you doing well.
 #     I wanted to inform you of some changes. The Center will be closing August 30
 #      th
 #      .  or September 1
 #      st
 #      .  I've enjoyed my experience.
 # 
 #

Upvotes: 0

eyquem

Reputation: 27575

A regex can be used to extract chunks of a text only if the text is composed of variable parts and stable portions (or at least portions having a stable variability...)

In the following regex pattern, I did some suppositions on the "stable" portions to raise the amount of them, making it possible to discriminate the emails and to extract the desired chunks in texts that seem to have few sure anchors:

I supposed that in the 'sent' part, there is always a name of one of the week's day
I supposed that if the line 'Importance' exist, there's only one word to describe this importance, then [^ \t\r\n]+
I supposed that the subject description can't be on several lines, then [^\r\n]+

If the amount of stable portion in a text is too low, that is to say the structure of the text is too loose, using a regex begins impossible.

The pattern [ \t\r\n]*(?P<from>.*?[^ \t\r\n])[ \t\r\n]*' has a strip effect on the captured group.
Then, if several blank lines constitute the message, the result of the match says that the message is ''

The presence of \Z is necessary to catch the mast email if there are no other lines after the last message, as in my text example.

import re


emailsString = (u'     From:\n'
                '     Lastname, Firstname\n'
                '     Sent:\n'
                '     Monday, June 24, 2013 1:48 PM\n'
                '     To:\n'
                '     Othername, Name\n'
                '     Subject:\n'
                '     RE: Center update\n'
                '    Message message message.\n'
                '    Such a lovely message\n'
                '    Take care,\n'
                '    Firstname Lastname, MS\n'
                '     Long signature\n'
                '     in this email\n'
                '   \n'
                '    E-mail:\n'
                '     [email protected]\n'
                '     Web\n'
                '     my blog\n'
                '     From:\n'
                '     Lastname, Firstname\n'
                '     Sent:\n'
                '     Monday, June 24, 2013 9:33 AM\n'
                '     To:\n'
                '     Othername, Name\n'
                '     Subject:\n'
                '     Center update\n'
                '     Importance:\n'
                '     High\n'
                '    Good Morning Name,\n'
                '    I hope this finds you doing well.\n'
                '    I wanted to inform you of some changes. The Center will be closing August 30\n'
                '     th\n'
                '     .  or September 1\n'
                '     st\n'
                '     .  I\u2019ve enjoyed my experience. ')


allEmailsString = '''
From: FirstLastname, FirstFirstname
Sent: Monday, July 15th, 2011, 9:36 AM
To: TheOne
Subject: blah
Importance: High

Message message message
second line of message

second para of message

From: MidLastname, MidFirstname
Sent: Thursday, July 18th, 2011, 10:45 AM
To: TWOTWO
Subject: once upon



From: LastLastname, LastFirstname
Sent: Saturday, July 20th, 2011, 12:51 AM
To: Mr Three
Subject: blobloblo

Nothing to say. '''



dispat = ("*  from: {from}\n"
          "*  to: {to}\n"
          "*  date: {date}\n"
          "*  subject: {subject}\n"
          "** message (beginning on next line):\n{message}\n"
          "-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-")



regx = re.compile('From:[ \t\r\n]*(?P<from>.*?[^ \t\r\n])'
                  '[ \t\r\n]*'
                  'Sent:[ \t\r\n]*'
                  '(?P<date>.*?(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.*?[^ \t\r\n])'
                  '[ \t\r\n]*'
                  'To:[ \t\r\n]*(?P<to>.*?[^ \t\r\n])'
                  '[ \t\r\n]*'
                  'Subject:[ \t\r\n]*(?P<subject>[^\r\n]+)'
                  '[ \t\r\n]*'
                  '(?:Importance:[ \t\r\n]*(?P<importance>[^ \t\r\n]+))?'
                  '[ \t\r\n]*'
                  '(?P<message>.*?)'
                  '(?=[ \t\r\n]*From:.*?'
                  'Sent:.*?(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.*?'
                  'To.*?Subject:|\Z)',
                  re.DOTALL)


for s in (emailsString,allEmailsString):
    print ''.join(dispat.format(**d)
                  for d in (ma.groupdict('') for ma in regx.finditer(s)))
    print '\n#######################################\n'

result

*  from: Lastname, Firstname
*  to: Othername, Name
*  date: Monday, June 24, 2013 1:48 PM
*  subject: RE: Center update
** message (beginning on next line):
Message message message.
    Such a lovely message
    Take care,
    Firstname Lastname, MS
     Long signature
     in this email

    E-mail:
     [email protected]
     Web
     my blog
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-*  from: Lastname, Firstname
*  to: Othername, Name
*  date: Monday, June 24, 2013 9:33 AM
*  subject: Center update
** message (beginning on next line):
Good Morning Name,
    I hope this finds you doing well.
    I wanted to inform you of some changes. The Center will be closing August 30
     th
     .  or September 1
     st
     .  I\u2019ve enjoyed my experience. 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

#######################################

*  from: FirstLastname, FirstFirstname
*  to: TheOne
*  date: Monday, July 15th, 2011, 9:36 AM
*  subject: blah
** message (beginning on next line):
Message message message
second line of message

second para of message
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-*  from: MidLastname, MidFirstname
*  to: TWOTWO
*  date: Thursday, July 18th, 2011, 10:45 AM
*  subject: once upon
** message (beginning on next line):

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-*  from: LastLastname, LastFirstname
*  to: Mr Three
*  date: Saturday, July 20th, 2011, 12:51 AM
*  subject: blobloblo
** message (beginning on next line):
Nothing to say. 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

#######################################

Upvotes: 1

mVChr

Reputation: 50177

I'd just divide (split) and conquer (re.match):

import re

# `data` is your text file
delimiter = r'(^|\n)From:'
capturer = re.compile(r'From:[\n\s]*(?P<from>.*)[\n\s]*'
                      r'Sent:[\n\s]*(?P<date>.*)[\n\s]*'
                      r'To:[\n\s]*(?P<to>.*)[\n\s]*'
                      r'Subject:[\n\s]*(?P<subject>.*)[\n\s]*'
                      r'(?:Importance:)?[\n\s]*.*[\n\s]*'
                      r'(?P<message>(\n|.)*)')

raw_emails = ['From:' + d for d in re.split(delimiter, data) if d.strip()]
emails = []
for raw_email in raw_emails:
    parts = capturer.match(raw_email)
    emails.append(parts.groupdict())

For your example data this output:

[{'date': 'Monday, July 15th, 2011, 9:36 AM',
  'from': 'Lastname, Firstname',
  'message': 'Message message message\nsecond line of message\n\nsecond para of message\n',
  'subject': 'blah',
  'to': 'Othername, Name'},
 {'date': 'Thursday, July 18th, 2011, 10:45 AM',
  'from': 'Lastname, Firstname',
  'message': '...\n',
  'subject': 'blahblah',
  'to': 'Othername, Name'}]

Upvotes: 0

Python regex: capturing group captures/overrides subsequent matches

Answers (3)

Related Questions