Ofer Sadan
Ofer Sadan

Reputation: 11952

Remove all images from docx files

I've searched the documentation for python-docx and other packages, as well as stack-overflow, but could not find how to remove all images from docx files with python.

My exact use-case: I need to convert hundreds of word documents to "draft" format to be viewed by clients. Those drafts should be identical the original documents but all the images must be deleted / redacted from them.

Sorry for not including an example of things I tried, what I have tried is hours of research that didn't give any info. I found this question on how to extract images from word files, but that doesn't delete them from the actual document: Extract pictures from Word and Excel with Python

From there and other sources I've found out that docx files could be read as simple zip files, I don't know if that means that it's possible to "re-zip" without the images without affecting the integrity of the docx file (edit: simply deleting the images works, but prevents python-docx from continuing to work with this file because of missing references to images), but thought this might be a path to a solution.

Any ideas?

Upvotes: 3

Views: 4701

Answers (3)

mata
mata

Reputation: 69082

If your goal is to redact images maybe this code I used for a similar usecase could be useful:

import sys
import zipfile
from PIL import Image, ImageFilter
import io

blur = ImageFilter.GaussianBlur(40)

def redact_images(filename):
    outfile = filename.replace(".docx", "_redacted.docx")
    with zipfile.ZipFile(filename) as inzip:
        with zipfile.ZipFile(outfile, "w") as outzip:
            for info in inzip.infolist():
                name = info.filename
                print(info)
                content = inzip.read(info)
                if name.endswith((".png", ".jpeg", ".gif")):
                        fmt = name.split(".")[-1]
                        img = Image.open(io.BytesIO(content))
                        img = img.convert().filter(blur)
                        outb = io.BytesIO()
                        img.save(outb, fmt)
                        content = outb.getvalue()
                        info.file_size = len(content)
                        info.CRC = zipfile.crc32(content)
                outzip.writestr(info, content)

Here I used PIL to blur images in some files, but instead of the blur filter any other suitable operation could be used. This worked quite nicely for my usecase.

Upvotes: 7

Alan
Alan

Reputation: 3042

I don't think it's currently implemented in python-docx.

Pictures in the Word Object Model are defined as either floating shapes or inline shapes. The docx documentation states that it only supports inline shapes.

The Word Object Model for Inline Shapes supports a Delete() method, which should be accessible. However, it is not listed in the examples of InlineShapes and there is also a similar method for paragraphs. For paragraphs, there is an open feature request to add this functionality - which dates back to 2014! If it's not added to paragraphs it won't be available for InlineShapes as they are implemented as discrete paragraphs.

You could do this with win32com if you have a machine with Word and Python installed. This would allow you to call the Word Object Model directly, giving you access to the Delete() method. In fact you could probably cheat - rather than scrolling through the document to get each image, you can call Find and Replace to clear the image. This SO question talks about win32com find and replace:

import win32com.client
from os import getcwd, listdir

docs = [i for i in listdir('.') if i[-3:]=='doc' or i[-4:]=='docx'] #All Word file

FromTo = {"First Name":"John",
      "Last Name":"Smith"} #You can insert as many as you want

word = win32com.client.DispatchEx("Word.Application")
word.Visible = True #Keep comment after tests
word.DisplayAlerts = False
for doc in docs:
    word.Documents.Open('{}\\{}'.format(getcwd(), doc))
    for From in FromTo.keys():
        word.Selection.Find.Text = From
        word.Selection.Find.Replacement.Text = FromTo[From]
        word.Selection.Find.Execute(Replace=2, Forward=True) #You made the mistake here=> Replace must be 2  
    name = doc.rsplit('.',1)[0]
    ext = doc.rsplit('.',1)[1]
    word.ActiveDocument.SaveAs('{}\\{}_2.{}'.format(getcwd(), name, ext))

word.Quit() # releases Word object from memory

In this case since we want images, we would need to use the short-code ^g as the find.Text and blank as the replacement.

word.Selection.Find
find.Text = "^g"
find.Replacement.Text = ""
find.Execute(Replace=1, Forward=True)

Upvotes: 1

JustLudo
JustLudo

Reputation: 1790

I don't know about this library, but looking through the documentation I found this section about images. It mentiones that it is currently not possible to insert images other than inline. If that is what you currently have in your documents, I assume you can also retrieve these by looking in the Document object and then remove them?

The Document is explained here.

Although not a duplicate, you might also want to look at this question's answer where user "scanny" explains how he finds images using the library.

Upvotes: 0

Related Questions