Detilium
Detilium

Reputation: 3034

Convert MS Outlook HTML content to "pretty" plain HTML

Reading an Outlook email in HTML, is not very pretty, and basically "useless" in my scenario. I'm currently building a support system, where users should be able to create new tickets, and comment on tickets via. email.

It is 100% certain that only Outlook will be used.

Here's my approach so far:

Subscription and reading the email

private static void OnEvent(object sender, NotificationEventArgs args)
{
    // Streaming subscription to EWS
    var subscription = args.Subscription;

    // Loop through notifications
    foreach(var notification in args.Events)
    {
        // If this is a news mail
        if(notification.EventType = EventType.NewMail)
        {
            var item = (ItemEvent)notification;

            // Define what properties to load
            var props new PropertySet(BasePropertySet.IdOnly,
                EmailMessageSchema.UniqueBody,
                EmailMessageSchema.From,
                EmailMessageSchema.Subject,
                /* ECT */);

            // We need the body to be in HTML
            props.RequestedBodyType = BodyType.HTML;

            // Bind the message
            var message = EmailMessage.Bind(subscription.Service, item.ItemId, props);

            // Handle the message with custom made handler
            Handlers.ReadEmailAndPerformAction(message);
        }
    }
}

Message handler

public static void ReadEmailAndPerformAction(EmailMessage message)
{
    var from = message.From.Address;
    var subject = message.Subject;
    var body = message.UniqueBody.Text;
    // BIND OTHER PROPERTIES

    if(isReply)
        CommentOnTicketFromEmail(/* Needed arguments */);
    else
        CreateNewTicketFromEmail(/* Needed arguments */);
}

PROBLEM
When I receive and read the email content in HTML, it looks quite weird. This is just Outlook it it's full glory, annoying any developers passing through, and the HTML is somewhat useless. I'd like to read and insert plain and basic HTML into my database, but this is not what I'm receiving.

Here's an example of the HTML content from a very basic email:

<html>
    <body>
        <div>
            <div>
                <span lang="da">
                    <div>
                        <div style="margin:0;">
                            <font face="Calibri,sans-serif" size="2">
                                <span style="font-size:11pt;">Test content</span>
                            </font>
                        </div>
                    </div>
                </span>
            </div>
        </div>
    </body>
</html>

For my system, this is just gibberish. I simply cannot understand why the input is not a paragraph and so on. Nonetheless, this is of course how Outlook decided to serve the content for me.

Somehow, anyhow, I'd like to convert this example into a simple HTML string like this:

<p>Test content</p>

The easiest for me would be to just read the content as plain text, but that would mess up lists, images, etc., and I wish to save lists and embedded images.

Upvotes: 2

Views: 728

Answers (1)

Detilium
Detilium

Reputation: 3034

Using regular expressions, I managed to beautify the outlook HTML mess into something just a bit more readable and pretty. It's still not 100% "plain" HTML (such as lists and such), but at least it's better.

C#

public static string PrepareBody(string body)
{
    var stripHead = new Regex(@"<body.*?>|<\/body>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
    var stripScript = new Regex(@"<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
    var stripStyle = new Regex(@"<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
    var stripFonts = new Regex(@"\sface=""(.*?)""|\ssize=""(.*?)""", RegexOptions.IgnoreCase | RegexOptions.Multiline);
    var stripInlineFontSize = new Regex(@"font-size:(.*?);", RegexOptions.IgnoreCase | RegexOptions.Multiline);

    var regBody = stripHead.Split(body);
    var content = "<div>" + regBody[1].Replace("\n", "\n<br />") + "</div>";
    content = stripScript.Replace(content, "");
    content = stripStyle.Replace(content, "");
    content = stripFonts.Replace(content, "");
    content = stripInlineFontSize.Replace(content, "");
    content = content.Replace("<o:p>", "")
                    .Replace("</o:p>", "")
                    .Replace(" class=\"WordSection1\"", "")
                    .Replace(" class=\"MsoPlainText\"", "")
                    .Replace(" class=\"MsoNormal\"", "")
                    .Replace("mso-fareast-language:DA", "")
                    .Replace("<br>", "<br />");


    return content;
}

Explanation

  • stripHead: removes the <head></head> and the <body></body> tags, getting the content within those.
  • stripScript: removes any <script></script> tags that might exist
  • stripStyle: removes any <style></style> tags that might exist
  • stripFonts: removes any styling within <font></font> tags (the <font> tag will still exists, as font color is dsiplayed like this: <font color="red">Content</font>, therefore we cannot completely remove the <font> tags)
  • stripInlineFontSize: Removes any font-size css property within inlined css (Example: style="font-size:11pt;")

Note though this is not a very pretty solution.

Upvotes: 1

Related Questions