Jsoup Bug? Parsing an .mht Document

Question

i am trying to parse an MHT-Document using Jsoup (Version: 1.7.3). The goal is to open two files and merge them together (joining head and body) to get one complete file. But firstly i got problems parsing the mht file because the parsed result has an significant lag of information and can´t be opened after parsing. What I did is the following:

Create a mht file using Word (containing one image and some text)
Parse it to String using Jsoup
Write the string to a file
Open the file and the file is broken

I used the following code:

private static final String USED_CHARSET = "windows-1252";
private static final String PATH = "C:\Test\";
private static final Charset CHARSET = Charset.forName(USED_CHARSET);

@Test
public void test() throws IOException {
    Document doc = Jsoup.parse(new File(PATH, "sourceMht.mht"),
            USED_CHARSET);

    writeDoc(new File(PATH, "parsedMht.mht"), doc.html());
}

private void writeDoc(File file, String html) throws IOException {
    Writer out = new BufferedWriter(new OutputStreamWriter(
            new FileOutputStream(file), CHARSET));
    try {
        out.write(html);
    } finally {
        out.flush();
        out.close();
    }
}

Thanks for your help.

andyroberts · Accepted Answer

It's not a Jsoup bug. The problem is that MHT files are MIME Multipart files, bundling html and other resources together into a single file. If you open your MHT file in a text editor (e.g. Notepad) you'll see that it's not a pure HTML file but a MIME encoded file:

MIME-Version: 1.0
Content-Type: multipart/related; boundary="----=_NextPart_01CFB635.40B30630"
....

Within certain sections there lies the various assets, such as html, css, images etc. So before you can apply Jsoup to the problem you first need to parse the MIME multipart file to get at the individual parts.

Some useful references for how to attack that problem include:

Jsoup Bug? Parsing an .mht Document

Answers (1)

Related Questions