Reputation: 41

Extract images from word document using Python

How can i extract images/logo from word document using python and store them in a folder. Following code converts docx to html but it doesn't extract images from the html. Any pointer/suggestion will be of great help.

    profile_path = <file path>
    result=mammoth.convert_to_html( profile_path)
    f = open(profile_path, 'rb')
    b = open(profile_html, 'wb')
    document = mammoth.convert_to_html(f)
    b.write(document.value.encode('utf8'))
    f.close()
    b.close()

Upvotes: 4

Answers (4)

FilipeTavares

Reputation: 63

Look the Alderven's answer at Extract all the images in a docx file using python

The zipfile works for more image formats than the docx2txt. For example, EMF images are not extracted by docx2txt but can be extracted by zipfile.

Upvotes: 1

K J

Reputation: 11877

Native without any lib

To extract the source Images from the docx (which is a variation on a zip file) without distortion or conversion.

shell out to OS and run

tar -m -xf DocxWithImages.docx word/media

You will find the source images Jpeg, PNG WMF or others in the word media folder extracted into a folder of that name. These are the unadulterated source embedment's without scale or crop.

You may be surprised that the visible area may be larger then any cropped version used in the docx itself, and thus need to be aware that Word does not always crop images as expected (A source of embarrassing redaction failure)

Upvotes: 3

dataninsight

Reputation: 1343

Extract all the images in a docx file using python

1. Using docxtxt

import docx2txt
#extract text 
text = docx2txt.process(r"filepath_of_docx")
#extract text and write images in Temporary Image directory
text = docx2txt.process(r"filepath_of_docx",r"Temporary_Image_Directory")

2. Using aspose

import aspose.words as aw
# load the Word document
doc = aw.Document(r"filepath")
# retrieve all shapes
shapes = doc.get_child_nodes(aw.NodeType.SHAPE, True)
imageIndex = 0
# loop through shapes
for shape in shapes :
    shape = shape.as_shape()
    if (shape.has_image) :
        # set image file's name
        imageFileName = f"Image.ExportImages.{imageIndex}_{aw.FileFormatUtil.image_type_to_extension(shape.image_data.image_type)}"
        # save image
        shape.image_data.save(imageFileName)
        imageIndex += 1

Upvotes: 1

gabriel capparelli

Reputation: 101

You can use the docx2txt library, it will read your .docx document and export images to a directory you specify (must exist).

!pip install docx2txt
import docx2txt
text = docx2txt.process("/path/your_word_doc.docx", '/home/example/img/')

After execution you will have the images in /home/example/img/ and the variable text will have the document text. They would be named image1.png ... imageN.png in order of appearance.

Note: Word document must be in .docx format.

Upvotes: 2