Reputation: 1329
I am using python to open EML files one at a time, process them then move them to another folder. EML file contains an email message including the headers.
The first 35-40 lines of the EML are header info, followed by the actual email message. Since the amount of lines of the header changes, I cant just convert my EML file to a list and tell it:
print emllist[37:]
However, the beginning of the last line of the headers is always the same and begins with X-OriginalArrivalTime.
My goal is to parse my EML file, search for the line number X-OriginalArrivalTime is on and then split the EML into 2 strings, one containing the headers info and one containing the message.
I have been rereading the python re documentation, but I cant seem to come up with a good way to attack this.
Any help is greatly appreciated
thanks
lou
Upvotes: 0
Views: 1241
Reputation: 27575
That's right that it would be interesting to avoid a regex, but presently, since you want to dispatch the header and the message into TWO different strings, I think that split(), that eliminates the sequence on which the split is made, and partition(), that returns a tuple of 3 elements, do not fit for the purpose , so a regex keeps interest:
import re
regx = re.compile('(.+?X-OriginalArrivalTime\.[^\n]*[\r\n]+)'
'(.+)\Z',
re.DOTALL)
ss = ('blahblah blah\r\n'
'totoro tootrototo \r\n'
'erteruuty\r\n'
'X-OriginalArrivalTime. 12h58 Huntington Point\r\n'
'body begins here\r\n'
'sdkjhqsdlfkghqdlfghqdfg\r\n'
'23135468796786876544\r\n'
'ldkshfqskdjf end of file\r\n')
header,message = regx.match(ss).groups()
print 'header :\n',repr(header)
print
print 'message :\n',repr(message)
result
header :
'blahblah blah\r\ntotoro tootrototo \r\nerteruuty\r\nX-OriginalArrivalTime. 12h58 Huntington Point\r\n'
message :
'body begins here\r\nsdkjhqsdlfkghqdlfghqdfg\r\n23135468796786876544\r\nldkshfqskdjf end of file\r\n'
Upvotes: 0
Reputation: 14487
I am not sure if it works with eml files, but python has a module to work with email files.
If that does not work, isn't it true that headers are split from message with an empty-line?
lines = fp.readlines()
header_end = lines.index('\n') # first empty line, I think it is the end of header.
headers = lines[:header_end]
message = lines[header_end:]
Upvotes: 0
Reputation: 3232
The re
module is not very good at counting lines. What's more, you probably don't need it to check for the contents of the start of a line. The following function takes the filename of the EML file as input and returns a tuple containing two strings: the header, and the message.
def process_eml(filename):
with open(filename) as fp:
lines = fp.readlines()
for i, line in enumerate(lines):
if line.startswith("X-OriginalArrivalTime"):
break
else:
raise Exception("End of header not found")
header = '\n'.join(lines[:i+1]) # Message starts at i + 1
message = '\n'.join(lines[i+1:])
return header, message
Upvotes: 1
Reputation: 363517
After
match = re.search(r'(.*^X-OriginalArrivalTime[^\n]*\n+)(.*)$',
open('foo.eml').read(),
re.DOTALL | re.MULTILINE)
match.groups(1)
should contain the headers and match.groups(2)
the email message's body. The re.DOTALL
flag causes .
to match newlines.
Upvotes: 0
Reputation: 521
You can probably avoid regex. How about:
msg = data.split('X-OriginalArrivalTime', 1)[1].split('\n', 1)[1]
Upvotes: 1