WesternGun
WesternGun

Reputation: 12767

Extract only text part of email body using javamail, without html content

In my project I am required to read mails and save its content in hard drive, from a MS Exchange email box using javamail. But I found that even the simplest email I receive is saved with html content, like head body and so on, even when I only write two words with format, without images, no attachment. But I just want the text of email.

Part of code:

Object content = part.getContent();
if (content instanceof InputStream || content instanceof String) {
        if (Part.ATTACHMENT.equalsIgnoreCase(part.getDisposition()) || 
            StringUtils.isNotBlank(part.getFileName())) {
    String messageBody = part.getContent().toString();
....(write this string to files)
    }  
}

I may write:

Hello world.

And I get a txt with all its html code, and fontface and tags like <html> and so on.

I saw this question and I found him only retrieving text content but I cannot comment there, so I must post a new question, and I see no difference between my code and his. He wrote:

if (disposition != null && (disposition.equals(BodyPart.ATTACHMENT))) {


    DataHandler handler = bodyPart.getDataHandler();

    s1 = (String) bodyPart.getContent();`

So is it about the DataHandler? But it is not used anywhere? Can someone help?

Upvotes: 2

Views: 5364

Answers (1)

Bill Shannon
Bill Shannon

Reputation: 29971

First of all, you'll want to read this JavaMail FAQ entry that tells you how to find the main message body. As written, it prefers an html body over a plain text body in cases where the message contains both. It should be clear how to reverse that preference.

But, not all messages will contain both html and plain text versions of the message body. If you get only html, you're going to have to write your own code to process the string and remove the html tags, or use some other product to process the html and remove the tags.

Upvotes: 1

Related Questions