Sean W.
Sean W.

Reputation: 5142

How can I extract HTML embedded in RTF using Python?

I'm trying to extract the HTML email bodies from Outlook msg files. I've successfully converted them to eml/standard RFC 822 files using email-outlook-message-perl, but the body of the emails are HTML wrapped in RTF. Here's an example snipit:

{\*\htmltag96 <div class="EduText" style="padding:2px;border-width:1px;background-color:#DEE5ED;border-color:##FAFAFA;border-style:solid;">}\htmlrtf {\htmlrtf0 {\*\htmltag64}\htmlrtf {\htmlrtf0 \htmlrtf{\f4\fs24\htmlrtf0 \'cd\'d5\'e0\'c1\'c5\'b9\'d5\'e9\'ca\'e8\'a7\'e4\'bb\'b7\'d5\'e8 john.smith\htmlrtf\f0}\htmlrtf0 
{\*\htmltag116 <br>}\htmlrtf \line
\htmlrtf0 

Is there a way to get the the HTML content, without all of the RTF crud?

Upvotes: 1

Views: 1452

Answers (2)

Suresh
Suresh

Reputation: 475

This is a few years old back thread, but this might be helpful for one who is new to TNEF and he is in similar situation...

If you are a Linux user, then you could extract the html content from rtf file using Linux command line tool unrtf

unrtf message.rtf

This will give you the output with html content.

If you want to redirect it into a file, then could try unrtf message.rtf > message.html

Hope this helps...

-Suresh

Upvotes: 1

BastianW
BastianW

Reputation: 2658

Microsoft is using TNEF (Transport Neutral Encapsulation Format). So I think you need to search for a TNEF Phyton implementation like:

Upvotes: 0

Related Questions