Using Java how do I strip the html from a pop3 email when reading it using javamail?

Question

I need to let users submit an email to an address which will be used to populate entities in my database. My program will periodically check the inbox for new mail and when it finds a new mail item, I need to use the subject, from, sent date, attachments and body to populate DB entities. I have been able to get all of the fields, but I am having trouble with the body when it contains html. I just need to store the text of the email. I would like to strip out all tags, signatures, etc. from the body. Is there a better way to do this other than regex?

Here is the function I am using to get the body text. My problem lies when the mimetype hits the "multipart/*" case in the last part of the function. The function returns the html message. What can I do to strip the tags in that section other than regex?

    /**
 * Return the primary text content of the message.
 */
private String getText(Part p) throws MessagingException, IOException {
    if (p.isMimeType("text/*")) {
        String s = (String)p.getContent();
        textIsHtml = p.isMimeType("text/html");
        return s;
    }

    if (p.isMimeType("multipart/alternative")) {
        // prefer html text over plain text
        Multipart mp = (Multipart)p.getContent();
        String text = null;
        for (int i = 0; i < mp.getCount(); i++) {
            Part bp = mp.getBodyPart(i);
            if (bp.isMimeType("text/plain")) {
                if (text == null){
                    text = getText(bp);
                }
                continue;
            } 
            else if (bp.isMimeType("text/html")) {
                String s = getText(bp);
                if (s != null){
                    return s;
                }
            } 
            else {
                return getText(bp);
            }
        }
        return text;
    } 
    else if (p.isMimeType("multipart/*")) {
        Multipart mp = (Multipart)p.getContent();
        for (int i = 0; i < mp.getCount(); i++) {
            String s = getText(mp.getBodyPart(i));
            if (s != null)
                return s;
        }
    }
    return null;
}

Any and all help is much appreciated.

I've been trying the following, but it is resulting in the spanish á problem I have commented about below.

 else if (p.isMimeType("multipart/*")) {
        Multipart mp = (Multipart)p.getContent();
        for (int i = 0; i < mp.getCount(); i++) {
            String s = getText(mp.getBodyPart(i));
            Document doc = Jsoup.parse(s);
            String retText = doc.text();
            retText.replaceAll("[0%d0%a]", "
");
            if (retText != null)
                return retText;
        }
    }

I have also tried [ ] and [ ] as my regex.

davidbuzatto · Accepted Answer

You can use some HTML parser like jsoup to traverse the HTML code and extract just the text that you want.

Take a look:

Using Java how do I strip the html from a pop3 email when reading it using javamail?

Answers (1)

Related Questions