Reputation:
how i can find image images present in document file, is there any module for this in python. I searched but of no use. this is how we can read from word file . code below give no information about images present in file
from docx import Document
documnet=Document('new-file-name.docx')
para=documnet.paragraphs
for par in para:
print par.text
Upvotes: 5
Views: 10025
Reputation: 131
You'll have to first extract all image files as .zip, look for image elements in your XML code and relate each image to it's rId.
import os
import docx
import docx2txt
# Extract the images to img_folder/
docx2txt.process('document.docx', 'img_folder/')
# Open you .docx document
doc = docx.Document('document.docx')
# Save all 'rId:filenames' relationships in an dictionary named rels
rels = {}
for r in doc.part.rels.values():
if isinstance(r._target, docx.parts.image.ImagePart):
rels[r.rId] = os.path.basename(r._target.partname)
# Then process your text
for paragraph in doc.paragraphs:
# If you find an image
if 'Graphic' in paragraph._p.xml:
# Get the rId of the image
for rId in rels:
if rId in paragraph._p.xml:
# Your image will be in os.path.join(img_path, rels[rId])
else:
# It's not an image
GitHub Repository Link: django-docx-import
Upvotes: 13
Reputation: 9937
Since .docx
files are zip files, you can use zipfile module:
import zipfile
z = zipfile.ZipFile("1.docx")
#print list of valid attributes for ZipFile object
print dir(z)
#print all files in zip archive
all_files = z.namelist()
print all_files
#get all files in word/media/ directory
images = filter(lambda x: x.startswith('word/media/'), all_files)
print images
#open an image and save it
image1 = z.open('word/media/image1.jpeg').read()
f = open('image1.jpeg','wb')
f.write(image1)
#Extract file
z.extract('word/media/image1.jpeg', r'path_to_dir')
Upvotes: 15