Regex over changing conditions in email Python

Question

I have employee-client email exchanges that need client message bodies pulled, to be consumed for future sentiment analysis.

These emails were generated using different email applications, so there's not a single regex rule I can use to separate the emails, and they don't all conform to the form used by the email module, so object-style parsing isn't a possibility. Sometimes different email applications were mixed in a single chain, so I can't regex on a specific profile either.

However, these rules have cropped up as reliable:

Email Start:

*@[not acme] wrote:
[newline]to: *@acme.com
[newline]to: *'acme support'
beginning of the chain

Email End:

*@[acme]> wrote:
[newline]From: *@acme.com
end of the chain

These can be mixed and matched during the life of a chain. A message can start with a 'wrote' rule, and end with a '[newline]From:' rule, or '*@acme> wrote' rule, etc.

Is there any elegant way to set different start and end conditions for this regex? Ideally, it would lazily stop at the first instance of hitting one of the end rules.

FWIW, I consider myself strictly intermediate with python. Experienced enough to struggle meaningfully through documentation, but not enough to play with the deeper layers of the language.

Example Source Data:

thank you john



from: noreply@acme.com [mailto:noreply@acme.com] on behalf
of acme help
sent: thursday, december 29, 2016 11:28 am
to: Jane Doe
subject: re: aha - overtime






hi Jane,



it is affected by your payroll schedule. because it is semi-monthly,
overtime is a tricky thing to calculate, so we have to make sure we do
it just right! once i turn this setting on, you will be good to go from
this point on!



best regards,



john doe

customer experience team 


[]


  



  

ref:_00d15ft7b._50015ypl8b:

Jane_Doe@her_company.com wrote:
refit will not come up, even after logging on. i have my pass word and user
    name write on a sheet of paper in my wallet, so i know it is correct. it
looks like it is trying to come up, but all i see is  two arrows going
in circles..



Jane



from: acme support [mailto:help@acme.com] 
sent: thursday, january 05, 2017 10:42 am
to: Jane Doe
subject: [graymail] re: happy new year from acme!



hello Jane,



sorry to hear that you're having trouble using acme. can you please
elaborate on the issue that you're experiencing?



best regards,



john

customer experience team 




--------------- original message ---------------
from: Jane Doe [jane_doe@her_company.com]
sent: 1/5/2017 11:34 am
to: support@acme.com
subject: re: [graymail] happy new year from amce!

our acme app is not working

Desired result (in any format, I've stored earlier, simpler regex in lists by leveraging re.findall()):

thank you john

refit will not come up, even after logging on. i have my pass word and user
        name write on a sheet of paper in my wallet, so i know it is correct. it
    looks like it is trying to come up, but all i see is  two arrows going
    in circles..



    Jane

our acme app is not working

EDIT:

I was able to parse chat logs earlier using code like this. The source data is currently stored in a pandas dataframe consisting of just a client_id-log pair. My current problem is structured identically, in client_id - email_chain pairs:

for index, row in df_chttext.iterrows(): #for each client-chat item:
    list_cleaned = [] #clear out old list_cleaned

    chat = row['chat_log'] #grab chat log
    list_visitor = re.findall('Visitor: .*?
', chat) #get list of only visitor messages

    if list_visitor: #if there is a list of client messages
        for message in list_visitor: #scrub the message
            scrub = message.replace('Visitor: ','')
            scrub = scrub.replace('
','')
            scrub = scrub.replace(''','\'')
            scrub = scrub.replace('>','>')
            scrub = scrub.replace('<','<')
            list_cleaned.append(scrub)
        df_chttext.at[index,'chat_log'] = list_cleaned #replace previous chat with scrubbed chat
    else:
        df_chttext.at[index,'chat_log'] = '' #if no user messages, then leave it empty

Andreas Storvik Strauman · Accepted Answer

I suggest a line-by-line "capture state" approach*. You read the file line by line and decide whether or not to include it in the final output.

Consider is the following:

Read a line.
Test if the line matches either a "Email start" or "Email end" pattern.
if it matches start, it should turn the capture state on, if it matches end it should turn the capture state off
Test if it matches lines that you do not want in the body, but shouldn't affect the capture state. If it matches then skip this line.
If capture state is on and we should not skip this line, then add it to the overall body. Else wise do not add it to the body and go to the next line.

Below is some python (almost pseudo) code to achieve this.

This script is not complete (I don't feel like writing all the code for you), but this could give you a base to start working from. (Maybe start by writing these for loops into a function e.g. def matches_pattern_in_list(text, patterns)

import re
fname="data.txt"
# Are we capturing data?
isCapturing=True
# Patterns that turn capturing state "on"
startPatterns=[
re.compile(r'[^@]+(?!acme)[a-zA-Z]+\.[a-zA-Z]{2,3}')
# .... more patterns here ....
]
# Patterns that will end the capturing state
endPatterns=[
    re.compile(r'*@[acme]> wrote:')
]
# Patterns that doesn't affect capturing state,
# but still should be ignored
ignorePatterns=[
    re.compile(r'from|sent|subject')
]
messageBodies=""
with open(fname) as f:
    line = f.readline()
    linenumber=1
    while line:
        skipThisLine=False
        for patt in startPatterns:
            if (patt.match(line)):
                isCapturing=True
                break
        for patt in endPatterns:
            if (patt.match(line)):
                isCapturing=False
                break
        for patt in ignorePatterns:
            if(patt.match(line)):
                skipThisLine=True
                break
        if isCapturing and not skipThisLine:
            messageBodies+=line

*: Yes. I did make that up.

Regex over changing conditions in email Python

Answers (1)

Related Questions