SomeDude
SomeDude

Reputation: 14238

Extracting images and text from an mht file

I have a mht file that contains images and some text. When I open it with notepad++, I see xml and then illegible text which I think are images. Can somebody tell me how can I extract images and text from an mht file using a java program? Thanks.

Upvotes: 15

Views: 31484

Answers (5)

rumpel
rumpel

Reputation: 8310

Try python-unmht ¹

¹ I’m the author and wrote this since none of the answers worked for me, but luckily MHTML turns out to be a very simple file format

Upvotes: 3

XP1
XP1

Reputation: 7193

mht2htm worked for me. It retains the filenames of the extracted files.

mht2htm converts MS Internet Explorer .mht files into common .html files you can open on any system with any browser.

mht2htm will extract all files from mht file in single directory. Then it will try to find addresses to extracted files and replace them with relative address. Other addresses to remote files will not be changed so you can get them from internet (if you wish).

Official website:

Download:

Upvotes: 1

conracer
conracer

Reputation: 1

All this doesn't seem to work anymore nowadays, at least for Chrome saved mhtml's, and it's not even in Java. Nevertheless a quick and dirty way on Windows was typing into WSL

sudo apt install mpack

then

munpack filename.mhtml

then append jpg extension to all generated files so Windows Explorer gives you a preview to see which files are images.

PS: Since munpack outputs the MIME-Types of each file line by line to stdout one could write a script that appends the correct file extensions after the munpack-process.

Upvotes: 0

Calimero100582
Calimero100582

Reputation: 852

It's a bit old, but Open it in Internet Explorer, and save as HTML also do the job

Update:

If you open the .mht file in IE, then save it, with the "Save as type" set to "Webpage, complete (.htm;.html)", then it will create the 'filename.htm' file, as well as a 'filename_files' directory. In that directory will be a lot of .tmp files. For output from the MS "Problem Steps Recorder", these will include among them a bunch of files with '(1)' in the name (as in there might be a 'mhtD3B8.tmp' file as well as a 'mhtD3B8(1).tmp' file). The '(1)' files are the images, in .jpg format, simply with a .tmp extension. Search for all the files with '(1)' in the name from that folder, and copy them to a different directory.

Once in the new directory, open a cmd window pointed there. To change all the extensions at once, type "rename *.tmp *.jpg" (without the quotes) and press Enter. Voila - all the image files are extracted.

As for accessing the text - since the file is now saved as a .htm file, you should be able to open that file in Notepad++ and parse/read it properly there.

Hope this helps!

Upvotes: 5

zb226
zb226

Reputation: 10529

There's an open-source perl tool called unmht which should do the job:

The first HTML file in the archive is taken to be the primary web page, the other contained files for "page requisites" such as images or frames. The primary web page is written to the output directory (the current directory by default), the requisites to a subdirectory named after the primary HTML file name without extension, with "_files" appended. Link URLs in all HTML files referring to requisites are rewritten to point to the saved files.

Upvotes: 4

Related Questions