Reputation: 77
I have employee-client email exchanges that need client message bodies pulled, to be consumed for future sentiment analysis.
These emails were generated using different email applications, so there's not a single regex rule I can use to separate the emails, and they don't all conform to the form used by the email module, so object-style parsing isn't a possibility. Sometimes different email applications were mixed in a single chain, so I can't regex on a specific profile either.
However, these rules have cropped up as reliable:
Email Start:
Email End:
These can be mixed and matched during the life of a chain. A message can start with a 'wrote' rule, and end with a '[newline]From:' rule, or '*@acme> wrote' rule, etc.
Is there any elegant way to set different start and end conditions for this regex? Ideally, it would lazily stop at the first instance of hitting one of the end rules.
FWIW, I consider myself strictly intermediate with python. Experienced enough to struggle meaningfully through documentation, but not enough to play with the deeper layers of the language.
Example Source Data:
thank you john
from: [email protected] [mailto:[email protected]] on behalf
of acme help
sent: thursday, december 29, 2016 11:28 am
to: Jane Doe
subject: re: aha - overtime
hi Jane,
it is affected by your payroll schedule. because it is semi-monthly,
overtime is a tricky thing to calculate, so we have to make sure we do
it just right! once i turn this setting on, you will be good to go from
this point on!
best regards,
john doe
customer experience team
[]
<http://portal.mxlogic.com/images/transparent.gif>
<http://portal.mxlogic.com/images/transparent.gif>
ref:_00d15ft7b._50015ypl8b:
Jane_Doe@her_company.com wrote:
refit will not come up, even after logging on. i have my pass word and user
name write on a sheet of paper in my wallet, so i know it is correct. it
looks like it is trying to come up, but all i see is two arrows going
in circles..
Jane
from: acme support [mailto:[email protected]]
sent: thursday, january 05, 2017 10:42 am
to: Jane Doe
subject: [graymail] re: happy new year from acme!
hello Jane,
sorry to hear that you're having trouble using acme. can you please
elaborate on the issue that you're experiencing?
best regards,
john
customer experience team
--------------- original message ---------------
from: Jane Doe [jane_doe@her_company.com]
sent: 1/5/2017 11:34 am
to: [email protected]
subject: re: [graymail] happy new year from amce!
our acme app is not working
Desired result (in any format, I've stored earlier, simpler regex in lists by leveraging re.findall()):
thank you john
refit will not come up, even after logging on. i have my pass word and user
name write on a sheet of paper in my wallet, so i know it is correct. it
looks like it is trying to come up, but all i see is two arrows going
in circles..
Jane
our acme app is not working
EDIT:
I was able to parse chat logs earlier using code like this. The source data is currently stored in a pandas dataframe consisting of just a client_id-log pair. My current problem is structured identically, in client_id - email_chain pairs:
for index, row in df_chttext.iterrows(): #for each client-chat item:
list_cleaned = [] #clear out old list_cleaned
chat = row['chat_log'] #grab chat log
list_visitor = re.findall('Visitor: .*?<br>', chat) #get list of only visitor messages
if list_visitor: #if there is a list of client messages
for message in list_visitor: #scrub the message
scrub = message.replace('Visitor: ','')
scrub = scrub.replace('<br>','')
scrub = scrub.replace(''','\'')
scrub = scrub.replace('>','>')
scrub = scrub.replace('<','<')
list_cleaned.append(scrub)
df_chttext.at[index,'chat_log'] = list_cleaned #replace previous chat with scrubbed chat
else:
df_chttext.at[index,'chat_log'] = '' #if no user messages, then leave it empty
Upvotes: 0
Views: 91
Reputation: 1655
I suggest a line-by-line "capture state" approach*. You read the file line by line and decide whether or not to include it in the final output.
Consider is the following:
on
, if it matches end it should turn the capture state off
Below is some python (almost pseudo) code to achieve this.
This script is not complete (I don't feel like writing all the code for you), but this could give you a base to start working from. (Maybe start by writing these for
loops into a function e.g. def matches_pattern_in_list(text, patterns)
import re
fname="data.txt"
# Are we capturing data?
isCapturing=True
# Patterns that turn capturing state "on"
startPatterns=[
re.compile(r'[^@]+(?!acme)[a-zA-Z]+\.[a-zA-Z]{2,3}')
# .... more patterns here ....
]
# Patterns that will end the capturing state
endPatterns=[
re.compile(r'*@[acme]> wrote:')
]
# Patterns that doesn't affect capturing state,
# but still should be ignored
ignorePatterns=[
re.compile(r'from|sent|subject')
]
messageBodies=""
with open(fname) as f:
line = f.readline()
linenumber=1
while line:
skipThisLine=False
for patt in startPatterns:
if (patt.match(line)):
isCapturing=True
break
for patt in endPatterns:
if (patt.match(line)):
isCapturing=False
break
for patt in ignorePatterns:
if(patt.match(line)):
skipThisLine=True
break
if isCapturing and not skipThisLine:
messageBodies+=line
*: Yes. I did make that up.
Upvotes: 1