danielv
danielv

Reputation: 3107

MIME message structure parsing and analysis

I am looking for an existing library or code samples, to extract the relevant parts from a mime message structure in order to perform analysis on the textual content of those parts.

I will explain:

I am writing a library (in Python) that is part of a project that needs to iterate over very large amount of email messages through IMAP. For each message, it needs to determine what are the mime parts it will need in order to analyze the textual content of the message that require the least amount of parsing (e.g. prefer text/plain over text/html or rich text) and without duplicates (i.e. if text/plain exists, ignore the matching text/html). It also needs to address nested parts (text attachments, forwarded messages, etc) and all this without downloading the entire message body (takes too much time and bandwidth). The end goal is later to retrieve only those parts in order to perform some statistical and pattern analysis on the text content of those messages (excluding any markup, meta data, binary data, etc).

The libraries and examples I've seen, require the full message body in order to assemble the message structure and understand the content of the message. I am trying to achieve this using the response from the IMAP FETCH command with the BODYSTRUCTURE data item.

BODYSTRUCTURE should contain enough information to achieve my goal but although the structure and returned data are officially documented in the relevant RFCs (3501, 2822, 2045), the amount of nesting, combinations and various quirks all add up to make the task very tedious and error prune.

Does anyone know any libraries that can help to achieve this or any code samples (preferably in Python but any language will do)?

Upvotes: 0

Views: 1202

Answers (2)

danielv
danielv

Reputation: 3107

Answering my own question for the sake of completeness and to close this question.

I couldn't find any existing library that answers the requirements. I ended up writing my own code to fetch BODYSTRUCTURE tree, parse it and store it in an internal structure. This gives me the control I need to decide which exact parts of the message I need to actually download and take into account various cases like attachments, forwards, redundant parts (plain text vs html) etc.

Upvotes: 0

Johan Lundberg
Johan Lundberg

Reputation: 27038

Is there something that you can not do with module email and the submodule email.mime ?

http://docs.python.org/library/email.html#module-email

Upvotes: 1

Related Questions