ManojK
ManojK

Reputation: 1640

How to import .msg files in Python along with attachments from a local directory

I am working on an Outlook email automation task, where I have Outlook's .msg email files stored at a directory. My task is to extract information (Email body, Attachment texts etc.) from the .msg files and run NLP to categorize them. So far I have used extract_msg from https://pypi.org/project/extract-msg/ and https://github.com/mattgwwalker/msg-extractor .I am able to extract mail body text but the next challenges I am facing are:

  1. How to extract text from attachments like pdf,text files?
  2. How to read a multi-part email (an email message with trail of replies)?

I read answers from multiple threads before writing my own question but most of the answers are related to extraction of emails directly from Outlook.exe however I do not need to extract information from Outlook rather the Outlook message are stored in a local directory as .msg files.

My progress so far is:

import extract_msg
import pandas as pd
import os

direct = os.getcwd() # directory object to be passed to the function for accessing emails

ext = '.msg' # type of files in the folder to be read

def DataImporter(directory, extension):
    my_list = []
    for i in os.listdir(direct):
        if i.endswith(ext):
            msg = extract_msg.Message(i)
            my_list.append([msg.filename,msg.sender,msg.to, msg.date, msg.subject, msg.body])
            global df
            df = pd.DataFrame(my_list, columns = ['File Name','From','To','Date','Subject','MailBody Text'])
    print(df.shape[0],' rows imported')

DataImporter(direct,ext)

And the requirement is like this:

Mail Body = 'This is a sample email body text'.

Attachment = 'Invoice123'

Attachment text = 'Your invoice is ready for processing'

Something like this, any help will be appreciated, please let me know if further information is required.

Edit: Please comment if you know any other package which can be used to achieve this task.

Upvotes: 2

Views: 11080

Answers (3)

Taki
Taki

Reputation: 11

there are solutions suitable for meeting your requirements. In my work, I test the MSG PY module from independent soft. This is Microsoft Outlook .msg file module for Python. The module allows you to easy create/read/parse/convert Outlook .msg files. For example:

from independentsoft.msg import Message
from independentsoft.msg import Attachment

message = Message(file_path = "e:\\message.msg")

for i in range(len(message.attachments)):
    attachment = message.attachments[i]
    attachment.save("e:\\" + str(attachment.file_name))

Upvotes: 1

ManojK
ManojK

Reputation: 1640

Posting the solution which worked for me (as asked by Amey P Naik). As mentioned I tried multiple modules but only extract_msg worked for the case in hand. I created two functions for importing the outlook message text and attachments as a Pandas DataFrame, first function would create one folder each for the email message and second would import the data from message to dataframe. Attachments need to be processed separately using for loop on the sub-directories in the parent directory. Below are the two functions I created with comments:

# 1). Import the required modules and setup working directory

import extract_msg
import os
import pandas as pd
direct = os.getcwd() # directory object to be passed to the function for accessing emails, this is where you will store all .msg files
ext = '.msg' #type of files in the folder to be read

# 2). Create separate folder by email name and extract data 

def content_extraction(directory,extension):
    for mail in os.listdir(directory):
        try:
            if mail.endswith(extension):
                msg = extract_msg.Message(mail) #This will create a local 'msg' object for each email in direcory
                msg.save() #This will create a separate folder for each email inside the parent folder and save a text file with email body content, also it will download all attachments inside this folder.            
        except(UnicodeEncodeError,AttributeError,TypeError) as e:
            pass # Using this as some emails are not processed due to different formats like, emails sent by mobile.

content_extraction(direct,ext)

#3).Import the data to Python DataFrame using the extract_msg module
#note this will not import data from the sub-folders inside the parent directory 
#rather it will extract the information from .msg files, you can use a loop instead 
#to directly import data from the files saved on sub-folders.

def DataImporter(directory, extension):
    my_list = []
    for i in os.listdir(direct):
        try:
            if i.endswith(ext):
                msg = extract_msg.Message(i)
                my_list.append([msg.filename,msg.sender,msg.to, msg.date, msg.subject, msg.body, msg.message_id]) #These are in-built features of '**extract_msg.Message**' class
                global df
                df = pd.DataFrame(my_list, columns = ['File Name','From','To','Date','Subject','MailBody Text','Message ID'])
                print(df.shape[0],' rows imported')
        except(UnicodeEncodeError,AttributeError,TypeError) as e:
            pass

DataImporter(direct,ext)

Post running these 2 functions, you will have almost all information inside a Pandas DataFrame, which you can use as per your need. If you also need to extract content from attachments, you need to create a loop for all sub-directories inside the parent directory to read the attachment files as per their format, like in my case the formats were .pdf,.jpg,.png,.csv etc. Getting data from these format will require different techniques like for getting data from pdf you will need Pytesseract OCR module.

If you find an easier way to extract content from attachments, please post your solution here for future reference, if you have any questions, please comment. Also if there is any scope of improvement in the above code, please feel free to highlight.

Upvotes: 0

Dmitry Streblechenko
Dmitry Streblechenko

Reputation: 66255

In the Outlook Object Model, use Application.Session.OpenSharedItem: pass the fully qualified MSG file name and get back MailItem object.

Upvotes: 1

Related Questions