I'm trying to read in files from the released Enron Dataset for a data science project. My problem lies in how I'm trying to read in my files. Basically, the first 15 or so lines of every email is information about the email itself: to, from, subject, etc. Thus, you would think read in the first 15 lines and assign them into an array. The problem that arises is that I'm trying to use whitespace in my algorithm, but sometimes there can be like 50 lines for the "to" column. Example of a (slightly truncated) troublesome email: Message-ID: <29403111.1075855665483.JavaMail.evans@thyme> Date: Wed, 13 Dec 2000 08:22:00 -0800 (PST) From: rebecca.cantrell@enron.com To: stephanie.miller@enron.com, ruth.concannon@enron.com, jane.tholt@enron.com, tori.kuykendall@enron.com, randall.gay@enron.com, phillip.allen@enron.com, timothy.hamilton@enron.com, robert.superty@enron.com, collee n.sullivan@enron.com, donna.greif@enron.com, julie.gomez@enron.com Subject: Final Filed Version -- SDG&E Comments My code: def readEmailHead(username, emailNum): text = "" file = open(corpus_root + username + '/all_documents/' + emailNum) for line in file: text += line file.close() email = text.split('\n') count = 0 for line in email: mem = [] if line == '': pass else: if line[0].isspace(): print(line,count) email[count-1] += line del email[count] count += 1 return [email[:20]] Right now it can handle emails with an extra line in the subject/to/from/etc, but not any more. Any ideas?

Reputation: 41

Reading in Files with Meaningful Whitespace (Python)

I'm trying to read in files from the released Enron Dataset for a data science project. My problem lies in how I'm trying to read in my files. Basically, the first 15 or so lines of every email is information about the email itself: to, from, subject, etc. Thus, you would think read in the first 15 lines and assign them into an array. The problem that arises is that I'm trying to use whitespace in my algorithm, but sometimes there can be like 50 lines for the "to" column.

Example of a (slightly truncated) troublesome email:

Message-ID: <29403111.1075855665483.JavaMail.evans@thyme>
Date: Wed, 13 Dec 2000 08:22:00 -0800 (PST)
From: [email protected]
To: [email protected], [email protected], [email protected], 
    [email protected], [email protected], 
    [email protected], [email protected], 
    [email protected], collee [email protected], 
    [email protected], [email protected]
Subject: Final Filed Version -- SDG&E Comments

My code:

def readEmailHead(username, emailNum):
    text = ""
    file = open(corpus_root + username + '/all_documents/' + emailNum)
    for line in file:
        text += line
    file.close()
    email = text.split('\n')
    count = 0
    for line in email:
        mem = []
        if line == '':
            pass
        else:
            if line[0].isspace():
                print(line,count)
                email[count-1] += line
                del email[count]
        count += 1
        return [email[:20]]

Right now it can handle emails with an extra line in the subject/to/from/etc, but not any more. Any ideas?

Upvotes: 2

Answers (3)

FredrikHedman

Reputation: 1253

No need to reinvent the wheel. The module email.parse can be your friend. I include a more portable way of constructing the file name so to just parse the header you could use the built-in parser and write a function like:

import email.parser
import os.path


def read_email_header(username, email_number, corpus_root='~/tmp/data/enron'):
    corpus_root = os.path.expanduser(corpus_root)
    fname = os.path.join(corpus_root, username, 'all_documents', email_number)
    with open(fname, 'rb') as fd:
        header = email.parser.BytesHeaderParser().parse(fd)
    return header


mm = read_email_header('dasovich-j', '13078.')

print(mm.keys())
print(mm['Date'])
print(mm['From'])
print(mm['To'].split())
print(mm['Subject'])

Running this gives:

['Message-ID', 'Date', 'From', 'To', 'Subject', 'Mime-Version', 'Content-Type', 'Content-Transfer-Encoding', 'X-From', 'X-To', 'X-cc', 'X-bcc', 'X-Folder', 'X-Origin', 'X-FileName']
Fri, 25 May 2001 02:50:00 -0700 (PDT)
[email protected]
['[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected],', '[email protected]']
Reuters -- FERC told Calif natgas to reach limit this summer

Upvotes: 1

Nicole Douglas

Reputation: 649

Maybe use regular expressions accordingly to your needs. For example, you can identify sent to email addresses as follows:

import regex as re

sent_to=[] 

def sent_to(text): #Add the text part from the file you want

    global sent_to

    email = re.search(r'(.+@.+\..+)', text) #Regex pattern to match an email

    if email:
       sent_to.append(list(email.groups())) #Adds the email into the sent_to list for each email

Upvotes: 0

Davis Herring

Reputation: 40013

The easy way to approach problems like this (setting aside the good idea of using an existing parser) is to treat the transformation as being performed one one list of lines to yield another list, rather than trying to mutate an existing list while looping over it. Something like:

new=[]
for l in old:
  if is_continuation(l): new[-1]+=l
  else: new.append(l)

For all but the longest lists (where del old[i] is expensive anyway) this is quite efficient if most lines are not continuations since they can be reused in new as-is.

Upvotes: 0

Reading in Files with Meaningful Whitespace (Python)

Answers (3)

Related Questions