ccDict
ccDict

Reputation: 653

Jsoup Bug? Parsing an .mht Document

i am trying to parse an MHT-Document using Jsoup (Version: 1.7.3). The goal is to open two files and merge them together (joining head and body) to get one complete file. But firstly i got problems parsing the mht file because the parsed result has an significant lag of information and can´t be opened after parsing. What I did is the following:

I used the following code:

private static final String USED_CHARSET = "windows-1252";
private static final String PATH = "C:\\Test\\";
private static final Charset CHARSET = Charset.forName(USED_CHARSET);

@Test
public void test() throws IOException {
    Document doc = Jsoup.parse(new File(PATH, "sourceMht.mht"),
            USED_CHARSET);

    writeDoc(new File(PATH, "parsedMht.mht"), doc.html());
}

private void writeDoc(File file, String html) throws IOException {
    Writer out = new BufferedWriter(new OutputStreamWriter(
            new FileOutputStream(file), CHARSET));
    try {
        out.write(html);
    } finally {
        out.flush();
        out.close();
    }
}

Thanks for your help.

Upvotes: 2

Views: 1190

Answers (1)

andyroberts
andyroberts

Reputation: 3518

It's not a Jsoup bug. The problem is that MHT files are MIME Multipart files, bundling html and other resources together into a single file. If you open your MHT file in a text editor (e.g. Notepad) you'll see that it's not a pure HTML file but a MIME encoded file:

MIME-Version: 1.0
Content-Type: multipart/related; boundary="----=_NextPart_01CFB635.40B30630"
....

Within certain sections there lies the various assets, such as html, css, images etc. So before you can apply Jsoup to the problem you first need to parse the MIME multipart file to get at the individual parts.

Some useful references for how to attack that problem include:

Upvotes: 3

Related Questions