Reputation: 9646
I'm trying to traverse through a word document and save all the images found in the word document. I tried uploading the sample word document to the online demo and noticed that images are listed as:
/word/media/image1.png rId5 image/png
/word/media/image2.png rId5 image/png
/word/media/image3.jpg rId5 image/jpeg
How can I programmatically save these images while traversing the document?
Currently I get all the text from the document like this:
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(filePath))
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart()
Document wmlDocumentEl = (org.docx4j.wml.Document)documentPart.getJaxbElement()
Body body = wmlDocumentEl.getBody();
DocumentTraverser traverser = new DocumentTraverser();
class DocumentTraverser extends TraversalUtil.CallbackImpl {
@Override
public List<Object> apply(Object o) {
if (o instanceof org.docx4j.wml.Text) {
....
}
return null;
}
}
Upvotes: 4
Views: 2854
Reputation: 45
To access the embedded images in a .docx file, use the following steps:
◾If it's not already a .docx file, Open the file in Word 2007 and save the file as a Word Document (*.docx). ◾Change the file extension on the original file from .docx to .zip, as shown in Figure D.
Upvotes: 0
Reputation: 15878
For embedded (as opposed to external) images, the simplest approach is:
import java.io.FileOutputStream;
import java.util.Map.Entry;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.Part;
import org.docx4j.openpackaging.parts.PartName;
import org.docx4j.openpackaging.parts.WordprocessingML.BinaryPart;
import org.docx4j.openpackaging.parts.WordprocessingML.BinaryPartAbstractImage;
public class SaveImages {
public static void main(String[] args) throws Exception {
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));
for (Entry<PartName, Part> entry : wordMLPackage.getParts().getParts().entrySet()) {
if (entry.getValue() instanceof BinaryPartAbstractImage) {
FileOutputStream fos = new FileOutputStream( yourfile ); // TODO: you can get file extension from PartName, or part class.
((BinaryPart)entry.getValue()).writeDataToOutputStream(fos);
fos.close();
}
}
}
}
If you care about the context of the images, you have to search for them in the relevant parts (eg MainDocumentPart, and your header/footer parts etc as required).
https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/ImageConvertEmbeddedToLinked.java will give you a hint as to how to do that. Note that there are two different XML structures for images. The newer DrawingML XML, and the legacy VML.
Upvotes: 3