Reputation: 41
I'm trying to read in files from the released Enron Dataset for a data science project. My problem lies in how I'm trying to read in my files. Basically, the first 15 or so lines of every email is information about the email itself: to, from, subject, etc. Thus, you would think read in the first 15 lines and assign them into an array. The problem that arises is that I'm trying to use whitespace in my algorithm, but sometimes there can be like 50 lines for the "to" column.
Example of a (slightly truncated) troublesome email:
Message-ID: <29403111.1075855665483.JavaMail.evans@thyme>
Date: Wed, 13 Dec 2000 08:22:00 -0800 (PST)
From: [email protected]
To: [email protected], [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], collee [email protected],
[email protected], [email protected]
Subject: Final Filed Version -- SDG&E Comments
My code:
def readEmailHead(username, emailNum):
text = ""
file = open(corpus_root + username + '/all_documents/' + emailNum)
for line in file:
text += line
file.close()
email = text.split('\n')
count = 0
for line in email:
mem = []
if line == '':
pass
else:
if line[0].isspace():
print(line,count)
email[count-1] += line
del email[count]
count += 1
return [email[:20]]
Right now it can handle emails with an extra line in the subject/to/from/etc, but not any more. Any ideas?
Upvotes: 2
Views: 93
Reputation: 1253
No need to reinvent the wheel. The module email.parse
can be your friend. I include a more portable way of constructing the file name so to just parse the header you could use the built-in parser and write a function like:
import email.parser
import os.path
def read_email_header(username, email_number, corpus_root='~/tmp/data/enron'):
corpus_root = os.path.expanduser(corpus_root)
fname = os.path.join(corpus_root, username, 'all_documents', email_number)
with open(fname, 'rb') as fd:
header = email.parser.BytesHeaderParser().parse(fd)
return header
mm = read_email_header('dasovich-j', '13078.')
print(mm.keys())
print(mm['Date'])
print(mm['From'])
print(mm['To'].split())
print(mm['Subject'])
Running this gives:
['Message-ID', 'Date', 'From', 'To', 'Subject', 'Mime-Version', 'Content-Type', 'Content-Transfer-Encoding', 'X-From', 'X-To', 'X-cc', 'X-bcc', 'X-Folder', 'X-Origin', 'X-FileName']
Fri, 25 May 2001 02:50:00 -0700 (PDT)
[email protected]
['[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected]']
Reuters -- FERC told Calif natgas to reach limit this summer
Upvotes: 1
Reputation: 649
Maybe use regular expressions accordingly to your needs. For example, you can identify sent to email addresses as follows:
import regex as re
sent_to=[]
def sent_to(text): #Add the text part from the file you want
global sent_to
email = re.search(r'(.+@.+\..+)', text) #Regex pattern to match an email
if email:
sent_to.append(list(email.groups())) #Adds the email into the sent_to list for each email
Upvotes: 0
Reputation: 40013
The easy way to approach problems like this (setting aside the good idea of using an existing parser) is to treat the transformation as being performed one one list of lines to yield another list, rather than trying to mutate an existing list while looping over it. Something like:
new=[]
for l in old:
if is_continuation(l): new[-1]+=l
else: new.append(l)
For all but the longest lists (where del old[i]
is expensive anyway) this is quite efficient if most lines are not continuations since they can be reused in new
as-is.
Upvotes: 0