Reputation: 3034
Reading an Outlook email in HTML, is not very pretty, and basically "useless" in my scenario. I'm currently building a support system, where users should be able to create new tickets, and comment on tickets via. email.
It is 100% certain that only Outlook will be used.
Here's my approach so far:
Subscription and reading the email
private static void OnEvent(object sender, NotificationEventArgs args)
{
// Streaming subscription to EWS
var subscription = args.Subscription;
// Loop through notifications
foreach(var notification in args.Events)
{
// If this is a news mail
if(notification.EventType = EventType.NewMail)
{
var item = (ItemEvent)notification;
// Define what properties to load
var props new PropertySet(BasePropertySet.IdOnly,
EmailMessageSchema.UniqueBody,
EmailMessageSchema.From,
EmailMessageSchema.Subject,
/* ECT */);
// We need the body to be in HTML
props.RequestedBodyType = BodyType.HTML;
// Bind the message
var message = EmailMessage.Bind(subscription.Service, item.ItemId, props);
// Handle the message with custom made handler
Handlers.ReadEmailAndPerformAction(message);
}
}
}
Message handler
public static void ReadEmailAndPerformAction(EmailMessage message)
{
var from = message.From.Address;
var subject = message.Subject;
var body = message.UniqueBody.Text;
// BIND OTHER PROPERTIES
if(isReply)
CommentOnTicketFromEmail(/* Needed arguments */);
else
CreateNewTicketFromEmail(/* Needed arguments */);
}
PROBLEM
When I receive and read the email content in HTML, it looks quite weird. This is just Outlook it it's full glory, annoying any developers passing through, and the HTML is somewhat useless. I'd like to read and insert plain and basic HTML into my database, but this is not what I'm receiving.
Here's an example of the HTML content from a very basic email:
<html>
<body>
<div>
<div>
<span lang="da">
<div>
<div style="margin:0;">
<font face="Calibri,sans-serif" size="2">
<span style="font-size:11pt;">Test content</span>
</font>
</div>
</div>
</span>
</div>
</div>
</body>
</html>
For my system, this is just gibberish. I simply cannot understand why the input is not a paragraph and so on. Nonetheless, this is of course how Outlook decided to serve the content for me.
Somehow, anyhow, I'd like to convert this example into a simple HTML string like this:
<p>Test content</p>
The easiest for me would be to just read the content as plain text, but that would mess up lists, images, etc., and I wish to save lists and embedded images.
Upvotes: 2
Views: 728
Reputation: 3034
Using regular expressions, I managed to beautify the outlook HTML mess into something just a bit more readable and pretty. It's still not 100% "plain" HTML (such as lists and such), but at least it's better.
C#
public static string PrepareBody(string body)
{
var stripHead = new Regex(@"<body.*?>|<\/body>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
var stripScript = new Regex(@"<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
var stripStyle = new Regex(@"<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
var stripFonts = new Regex(@"\sface=""(.*?)""|\ssize=""(.*?)""", RegexOptions.IgnoreCase | RegexOptions.Multiline);
var stripInlineFontSize = new Regex(@"font-size:(.*?);", RegexOptions.IgnoreCase | RegexOptions.Multiline);
var regBody = stripHead.Split(body);
var content = "<div>" + regBody[1].Replace("\n", "\n<br />") + "</div>";
content = stripScript.Replace(content, "");
content = stripStyle.Replace(content, "");
content = stripFonts.Replace(content, "");
content = stripInlineFontSize.Replace(content, "");
content = content.Replace("<o:p>", "")
.Replace("</o:p>", "")
.Replace(" class=\"WordSection1\"", "")
.Replace(" class=\"MsoPlainText\"", "")
.Replace(" class=\"MsoNormal\"", "")
.Replace("mso-fareast-language:DA", "")
.Replace("<br>", "<br />");
return content;
}
Explanation
stripHead
: removes the <head></head>
and the <body></body>
tags, getting the content within those.stripScript
: removes any <script></script>
tags that might existstripStyle
: removes any <style></style>
tags that might existstripFonts
: removes any styling within <font></font>
tags (the <font>
tag will still exists, as font color is dsiplayed like this: <font color="red">Content</font>
, therefore we cannot completely remove the <font>
tags)stripInlineFontSize
: Removes any font-size
css property within inlined css (Example: style="font-size:11pt;"
)Note though this is not a very pretty solution.
Upvotes: 1