barciewicz
barciewicz

Reputation: 3813

Open Outlook .msg like a text file in Python?

I want to treat Outlook .msg file as string and check if a substring exists in it.

So I thought importing win32 library, which is suggested in similar SO threads, would be an overkill.

Instead, I tried to just open the file the same way as a .txt file:

file_path= 'O:\\MAP\\177926 Delete comiitted position.msg'

mail = open(file_path)
mail_contents = mail.read()
print(mail_contents)

However, I get

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 870: character maps to <undefined>

Is there any decoding I can specify to make it work?

I have also tried

mail = open(file_path, encoding='utf-8')

which returns

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

Upvotes: 3

Views: 6908

Answers (2)

Danny_ds
Danny_ds

Reputation: 11406

Unless you're willing to do a lot of work, you really should use a library for this.

First, a .msg file is a binary file, so the contents should not be read in as a string. A string is usually terminated with a null byte, and binary files can have a lot of those inside, which could mean you're not looking at all the data (might depend on the implementation).

Also, the .msg file can have plain ascii and/or unicode in different parts/blocks of the file, so it would be really hard to treat this as one string to search for a substring.

As an alternative you could save the mails as .eml (i.e. the plain text version of an e-mail), but there would still be some problems to overcome in order to search for a specific text:

  • All data in an e-mail are lower ascii (1-127) which means special characters have to be encoded to lower ascii bytes. There are several different encodings for headers (for example 'Subject'), body, attachment.
  • Body text: can be plain text or hml (or both). Lines and words can be split because there is a maximum line length. Different encodings can be used, even base64 in which you would never find the text you're looking for.
  • A lot more would have to be done to properly decode everything, but this should give you an idea of the work you would have to do in order to find the text you're looking for.

Upvotes: 2

scharette
scharette

Reputation: 9997

When you face these type of issues, it is good pratice to try the Python Latin-1 encoding.

mail = open(file_path, encoding='Latin-1')

We often confound the Windows cp1252 encoding with the actual Python's Latin-1. Indeed, the latter maps all possible byte values to the first 256 Unicode code points.

See this for more information.

Upvotes: 1

Related Questions