D3l_Gato
D3l_Gato

Reputation: 1329

Python - search for string, copy until end of doc

I am using python to open EML files one at a time, process them then move them to another folder. EML file contains an email message including the headers.

The first 35-40 lines of the EML are header info, followed by the actual email message. Since the amount of lines of the header changes, I cant just convert my EML file to a list and tell it:

print emllist[37:]

However, the beginning of the last line of the headers is always the same and begins with X-OriginalArrivalTime.

My goal is to parse my EML file, search for the line number X-OriginalArrivalTime is on and then split the EML into 2 strings, one containing the headers info and one containing the message.

I have been rereading the python re documentation, but I cant seem to come up with a good way to attack this.

Any help is greatly appreciated

thanks

lou

Upvotes: 0

Views: 1241

Answers (5)

eyquem
eyquem

Reputation: 27575

That's right that it would be interesting to avoid a regex, but presently, since you want to dispatch the header and the message into TWO different strings, I think that split(), that eliminates the sequence on which the split is made, and partition(), that returns a tuple of 3 elements, do not fit for the purpose , so a regex keeps interest:

import re

regx = re.compile('(.+?X-OriginalArrivalTime\.[^\n]*[\r\n]+)'
                  '(.+)\Z',
                  re.DOTALL)

ss = ('blahblah blah\r\n'
      'totoro tootrototo \r\n'
      'erteruuty\r\n'
      'X-OriginalArrivalTime. 12h58 Huntington Point\r\n'
      'body begins here\r\n'
      'sdkjhqsdlfkghqdlfghqdfg\r\n'
      '23135468796786876544\r\n'
      'ldkshfqskdjf end of file\r\n')


header,message = regx.match(ss).groups()

print 'header :\n',repr(header)
print
print 'message :\n',repr(message)

result

header :
'blahblah blah\r\ntotoro tootrototo \r\nerteruuty\r\nX-OriginalArrivalTime. 12h58 Huntington Point\r\n'

message :
'body begins here\r\nsdkjhqsdlfkghqdlfghqdfg\r\n23135468796786876544\r\nldkshfqskdjf end of file\r\n'

Upvotes: 0

Ski
Ski

Reputation: 14487

I am not sure if it works with eml files, but python has a module to work with email files.

If that does not work, isn't it true that headers are split from message with an empty-line?

lines = fp.readlines()
header_end = lines.index('\n') # first empty line, I think it is the end of header.
headers = lines[:header_end]
message = lines[header_end:]

Upvotes: 0

HardlyKnowEm
HardlyKnowEm

Reputation: 3232

The re module is not very good at counting lines. What's more, you probably don't need it to check for the contents of the start of a line. The following function takes the filename of the EML file as input and returns a tuple containing two strings: the header, and the message.

def process_eml(filename):
    with open(filename) as fp:
        lines = fp.readlines()

    for i, line in enumerate(lines):
        if line.startswith("X-OriginalArrivalTime"):
             break
    else:
        raise Exception("End of header not found")

    header = '\n'.join(lines[:i+1]) # Message starts at i + 1
    message = '\n'.join(lines[i+1:])

    return header, message

Upvotes: 1

Fred Foo
Fred Foo

Reputation: 363517

After

match = re.search(r'(.*^X-OriginalArrivalTime[^\n]*\n+)(.*)$',
                  open('foo.eml').read(),
                  re.DOTALL | re.MULTILINE)

match.groups(1) should contain the headers and match.groups(2) the email message's body. The re.DOTALL flag causes . to match newlines.

Upvotes: 0

thatwasbrilliant
thatwasbrilliant

Reputation: 521

You can probably avoid regex. How about:

msg = data.split('X-OriginalArrivalTime', 1)[1].split('\n', 1)[1]

Upvotes: 1

Related Questions