
Reputation: 13984

How to read or parse MHTML (.mht) files in java

I need to mine the content of most of known document files like:

  1. pdf
  2. html
  3. doc/docx etc.

For most of these file formats I am planning to use:


But as of now Tika does not support MHTML (*.mht) files.. ( http://en.wikipedia.org/wiki/MHTML ) There are few examples in C# ( http://www.codeproject.com/KB/files/MhtBuilder.aspx ) but I found none in Java.

I tried opening the *.mht file in 7Zip and it failed...Although the WinZip was able to decompress the file into images and text (CSS, HTML, Script) as text and binary files...

As per MSDN page ( http://msdn.microsoft.com/en-us/library/aa767785%28VS.85%29.aspx#compress_content ) and the code project page i mentioned earlier ... mht files use GZip compression ....

Attempting to decompress in java results in following exceptions: With java.uti.zip.GZIPInputStream

java.io.IOException: Not in GZIP format
at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at java.util.zip.GZIPInputStream.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:16)

And with java.util.zip.ZipFile

 java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(Unknown Source)
at java.util.zip.ZipFile.<init>(Unknown Source)
at GZipTest.main(GZipTest.java:21)

Kindly suggest how to decompress it....


Upvotes: 13

Views: 28062

Answers (6)


Reputation: 41

A more compact code using Java Mail APIs

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.net.URL;
import java.util.Properties;

import javax.mail.BodyPart;
import javax.mail.Session;
import javax.mail.internet.MimeMessage;
import javax.mail.internet.MimeMultipart;

import org.apache.commons.io.IOUtils;

public class MhtParser {

    private File mhtFile;
    private File outputFolder;

    public MhtParser(File mhtFile, File outputFolder) {
        this.mhtFile = mhtFile;
        this.outputFolder = outputFolder;

    public void decompress() throws Exception {
        MimeMessage message = 
            new MimeMessage(
                    Session.getDefaultInstance(new Properties(), null),
                    new FileInputStream(mhtFile));

        if (message.getContent() instanceof MimeMultipart) {
            MimeMultipart mimeMultipart = (MimeMultipart) message.getContent();

            for (int i = 0; i < mimeMultipart.getCount(); i++) {
                BodyPart bodyPart = mimeMultipart.getBodyPart(i);
                String fileName = bodyPart.getFileName();

                if (fileName == null) {
                    String[] locationHeader = bodyPart.getHeader("Content-Location");
                    if (locationHeader != null && locationHeader.length > 0) {
                        fileName = 
                            new File(new URL(locationHeader[0]).getFile()).getName();

                if (fileName != null) {
                    FileOutputStream out = 
                        new FileOutputStream(new File(outputFolder, fileName));

                    IOUtils.copy(bodyPart.getInputStream(), out);

Upvotes: 3

David Turner
David Turner

Reputation: 316

Late to the party, but expanding on @wener's answer for anyone else stumbling across this.

The Apache Mime4J library seems to have the most readily accessible solution for EML or MHTML processing, much easier than rolling-your-own!

My prototype 'parseMhtToFile' function below rips html files and other artifacts out of a Cognos active report 'mht' file, but could be tailored to other purposes.

This is written in Groovy and requires Apache Mime4J 'core' and 'dom' jars (currently 0.7.2).

import org.apache.james.mime4j.dom.Message
import org.apache.james.mime4j.dom.Multipart
import org.apache.james.mime4j.dom.field.ContentTypeField
import org.apache.james.mime4j.message.DefaultMessageBuilder
import org.apache.james.mime4j.stream.MimeConfig

 * Use Mime4J MessageBuilder to parse an mhtml file (assumes multipart) into
 * separate html files.
 * Files will be written to outDir (or parent) as baseName + partIdx + ext.
void parseMhtToFile(File mhtFile, File outDir = null) {
    if (!outDir) {outDir = mhtFile.parentFile }
    // File baseName will be used in generating new filenames
    def mhtBaseName = mhtFile.name.replaceFirst(~/\.[^\.]+$/, '')

    // -- Set up Mime parser, using Default Message Builder
    MimeConfig parserConfig  = new MimeConfig();
    parserConfig.setMaxHeaderLen(-1); // The default is a mere 10k
    parserConfig.setMaxLineLen(-1); // The default is only 1000 characters.
    parserConfig.setMaxHeaderCount(-1); // Disable the check for header count.
    DefaultMessageBuilder builder = new DefaultMessageBuilder();

    // -- Parse the MHT stream data into a Message object
    println "Parsing ${mhtFile}...";
    InputStream mhtStream = mhtFile.newInputStream()
    Message message = builder.parseMessage(mhtStream);

    // -- Process the resulting body parts, writing to file
    assert message.getBody() instanceof Multipart
    Multipart multipart = (Multipart) message.getBody();
    def parts = multipart.getBodyParts();
    parts.eachWithIndex { p, i ->
        ContentTypeField cType = p.header.getField('content-type')
        println "${p.class.simpleName}\t${i}\t${cType.mimeType}"

        // Assume mime sub-type is a "good enough" file-name extension 
        // e.g. text/html = html, image/png = png, application/json = json
        String partFileName = "${mhtBaseName}_${i}.${cType.subType}"
        File partFile = new File(outDir, partFileName)

        // Write part body stream to file
        println "Writing ${partFile}...";
        if (partFile.exists()) partFile.delete();
        InputStream partStream = p.body.inputStream;

Usage is simply:

File mhtFile = new File('<path>', 'Report-en-au.mht')
println 'Done.'

Output is:

Parsing <path>\Report-en-au.mht...
BodyPart    0   text/html
Writing <path>\Report-en-au_0.html...
BodyPart    1   image/png
Writing <path>\Report-en-au_1.png...

Thoughts on other improvements:

  • For 'text' mime parts, you can access a Reader instead of a Stream which might be more appropriate for text mining as the OP requested.

  • For generated filename extensions, I'd use another library to lookup appropriate extension, not assume the mime sub-type is adequate.

  • Handle Single-body (non-Multipart) and Recursive Multipart mhtml files and other complexities. These may require a MimeStreamParser with custom Content Handler implementation.

Upvotes: 1


Reputation: 7760

You don't have to do it on you own.

With dependency


Roll you mht file

public static void main(String[] args)
    MessageTree.main(new String[]{"YOU MHT FILE PATH"});

MessageTree will

 * Displays a parsed Message in a window. The window will be divided into
 * two panels. The left panel displays the Message tree. Clicking on a
 * node in the tree shows information on that node in the right panel.
 * Some of this code have been copied from the Java tutorial's JTree section.

Then you can look into it.


Upvotes: 2


Reputation: 13984

Frankly, I wasn't expecting a solution in near future and was about to give up, but some how I stumbled on this page:



Although, not a very catchy in first look. But if you look carefully you will get clue. After reading this I fired up my IE and at random started saving pages as *.mht file. Let me go line by line...

But let me explain beforehand that my ultimate goal was to separate/extract out the html content and parse it... the solution is not complete in itself as it depends on the character set or encoding I choose while saving. But even though it will extract the individual files with minor hitches...

I hope this will be useful for anyone who is trying to parse/decompress *.mht/MHTML files :)

======= Explanation ======== ** Taken from a mht file **

From: "Saved by Windows Internet Explorer 7"

It is the software used for saving the file

Subject: Google
Date: Tue, 13 Jul 2010 21:23:03 +0530
MIME-Version: 1.0

Subject, date and mime-version … much like the mail format

  Content-Type: multipart/related;

This is the part which tells us that it is a multipart document. A multipart document has one or more different sets of data combined in a single body, a multipart Content-Type field must appear in the entity's header. Here, we can also see the type as "text/html".


Out of all this is the most important part. This is the unique delimiter which divides two different parts (html,images,css,script etc). Once you get hold of this, everything gets easy... Now, I just have to iterate through the document and finding out different sections and saving them as per their Content-Transfer-Encoding (base64, quoted-printable etc) ... . . .


 Content-Type: text/html;
 Content-Transfer-Encoding: quoted-printable
 Content-Location: http://www.google.com/webhp?sourceid=navclient&ie=UTF-8

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" =


An interface for defining constants.

public interface IConstants 
    public String BOUNDARY = "boundary";
    public String CHAR_SET = "charset";
    public String CONTENT_TYPE = "Content-Type";
    public String CONTENT_TRANSFER_ENCODING = "Content-Transfer-Encoding";
    public String CONTENT_LOCATION = "Content-Location";

    public String UTF8_BOM = "=EF=BB=BF";

    public String UTF16_BOM1 = "=FF=FE";
    public String UTF16_BOM2 = "=FE=FF";

The main parser class...

 * This program and the accompanying materials are made available under the terms of the Eclipse Public License v1.0
 * which accompanies this distribution, and is available at
 * http://www.eclipse.org/legal/epl-v10.html
package com.test.mht.core;

import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.OutputStreamWriter;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import sun.misc.BASE64Decoder;

 * File to parse and decompose *.mts file in its constituting parts.
 * @author Manish Shukla 

public class MHTParser implements IConstants
    private File mhtFile;
    private File outputFolder;

    public MHTParser(File mhtFile, File outputFolder) {
        this.mhtFile = mhtFile;
        this.outputFolder = outputFolder;

     * @throws Exception
    public void decompress() throws Exception
        BufferedReader reader = null;

        String type = "";
        String encoding = "";
        String location = "";
        String filename = "";
        String charset = "utf-8";
        StringBuilder buffer = null;

            reader = new BufferedReader(new FileReader(mhtFile));

            final String boundary = getBoundary(reader);
            if(boundary == null)
                throw new Exception("Failed to find document 'boundary'... Aborting");

            String line = null;
            int i = 1;
            while((line = reader.readLine()) != null)
                String temp = line.trim();
                    if(buffer != null) {
                        buffer = null;

                    buffer = new StringBuilder();
                }else if(temp.startsWith(CONTENT_TYPE)) {
                    type = getType(temp);
                }else if(temp.startsWith(CHAR_SET)) {
                    charset = getCharSet(temp);
                }else if(temp.startsWith(CONTENT_TRANSFER_ENCODING)) {
                    encoding = getEncoding(temp);
                }else if(temp.startsWith(CONTENT_LOCATION)) {
                    location = temp.substring(temp.indexOf(":")+1).trim();
                    filename = getFileName(location,type);
                }else {
                    if(buffer != null) {
                        buffer.append(line + "\n");

            if(null != reader)


    private String getCharSet(String temp) 
        String t = temp.split("=")[1].trim();
        return t.substring(1, t.length()-1);

     * Save the file as per character set and encoding 
    private void writeBufferContentToFile(StringBuilder buffer,String encoding, String filename, String charset) 
    throws Exception


        byte[] content = null; 

        boolean text = true;

            content = getBase64EncodedString(buffer);
            text = false;
        }else if(encoding.equalsIgnoreCase("quoted-printable")) {
            content = getQuotedPrintableString(buffer);         
            content = buffer.toString().getBytes();

            BufferedOutputStream bos = null;
                bos = new BufferedOutputStream(new FileOutputStream(filename));
            }finally {
            BufferedWriter bw = null;
                bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filename), charset));
                bw.write(new String(content));
            }finally {

     * When the save the *.mts file with 'utf-8' encoding then it appends '=EF=BB=BF'</br>
     * @see http://en.wikipedia.org/wiki/Byte_order_mark
    private byte[] getQuotedPrintableString(StringBuilder buffer) 
        //Set<String> uniqueHex = new HashSet<String>();
        //final Pattern p = Pattern.compile("(=\\p{XDigit}{2})*");

        String temp = buffer.toString().replaceAll(UTF8_BOM, "").replaceAll("=\n", "");

        //Matcher m = p.matcher(temp);
        //while(m.find()) {
        //  uniqueHex.add(m.group());


        //for (String hex : uniqueHex) {
            //temp = temp.replaceAll(hex, getASCIIValue(hex.substring(1)));

        return temp.getBytes();

    /*private String getASCIIValue(String hex) {
        return ""+(char)Integer.parseInt(hex, 16);
     * Although system dependent..it works well
    private byte[] getBase64EncodedString(StringBuilder buffer) throws Exception {
        return new BASE64Decoder().decodeBuffer(buffer.toString());

     * Tries to get a qualified file name. If the name is not apparent it tries to guess it from the URL.
     * Otherwise it returns 'unknown.<type>'
    private String getFileName(String location, String type) 
        final Pattern p = Pattern.compile("(\\w|_|-)+\\.\\w+");
        String ext = "";
        String name = "";
            ext = "jpg";
            ext = type.split("/")[1];

        if(location.endsWith("/")) {
            name = "main";
            name = location.substring(location.lastIndexOf("/") + 1);

            Matcher m = p.matcher(name);
            String fname = "";
            while(m.find()) {
                fname = m.group();

            if(fname.trim().length() == 0)
                name = "unknown";
                return getUniqueName(fname.substring(0,fname.indexOf(".")), fname.substring(fname.indexOf(".") + 1, fname.length()));
        return getUniqueName(name,ext);

     * Returns a qualified unique output file path for the parsed path.</br>
     * In case the file already exist it appends a numarical value a continues
    private String getUniqueName(String name,String ext)
        int i = 1;
        File file = new File(outputFolder,name + "." + ext);
                file = new File(outputFolder, name + i + "." + ext);
                    return file.getAbsolutePath();

        return file.getAbsolutePath();

    private String getType(String line) {
        return splitUsingColonSpace(line);

    private String getEncoding(String line){
        return splitUsingColonSpace(line);

    private String splitUsingColonSpace(String line) {
        return line.split(":\\s*")[1].replaceAll(";", "");

     * Gives you the boundary string
    private String getBoundary(BufferedReader reader) throws Exception 
        String line = null;

        while((line = reader.readLine()) != null)
            line = line.trim();
            if(line.startsWith(BOUNDARY)) {
                return line.substring(line.indexOf("\"") + 1, line.lastIndexOf("\""));

        return null;


Upvotes: 15

Wajdy Essam
Wajdy Essam

Reputation: 4340

i was used http://jtidy.sourceforge.net to parse/read/index mht files (but as normal files, not compressed files)

Upvotes: 0


Reputation: 2688

U can try http://www.chilkatsoft.com/mht-features.asp , it can pack/unpack and you can handle it after as normal files. The download link is: http://www.chilkatsoft.com/java.asp

Upvotes: 0

Related Questions