Preston Donovan
Preston Donovan

Reputation: 61

Using Python to extract images and text from a word document

I would like to run a script on a folder full of word documents that reads through the documents and pulls out images and their captions (text right below the images). From the research I've done, I think pywin32 might be a viable solution. I know how to use pywin32 to find strings and pull them out, but I need help with the images part. How can I read through a docx file and have an event occur when an image is found? Thank you for any help! I am using Python 2.7.

Upvotes: 6

Views: 9839

Answers (4)

Sathish Kumar MK
Sathish Kumar MK

Reputation: 1

document =docx.Document(filepath)
for image in document.inline_shapes:
    print (image.width, image.height)

Try this it will work.

Upvotes: -2

Ankush Shah
Ankush Shah

Reputation: 958

You can use the python module docx2txt for extracting text as well as images from docx files

Upvotes: 2

Kevin C.
Kevin C.

Reputation: 2527

Docx files can be unzipped for extracting the images.

Upvotes: 4

Fredrik Pihl
Fredrik Pihl

Reputation: 45670

Find some inspiration in this post How can I search a word in a Word 2007 .docx file?

Upvotes: 3

Related Questions