Reputation: 31
This is my very first question here....
I have a lot of MSWord files with 1 or more PDF inserted as objects, i need to process all de word files and extract the pdfs to save them as pdf files, leaving de MS word file just like i found it. Until now i have this code to test it in one file:
import win32com.client as win32
word = win32.Dispatch('Word.Application')
word.Application.Visible = False
doc1 = word.Documents.Open('C:\\word_merge\\docx_con_pdfs.docx')
for s in doc1.InlineShapes:
if s.OLEFormat.ClassType == 'AcroExch.Document.DC':
s.OLEFormat.DoVerb()
_ = input("Hit Enter to Quit")
doc1.Close()
word.Application.Quit()
I know this work because the s.OLEFormat.DoVerb() effectivly opens the files in Adobe Reader and kept them open until "Hit Enter" moment, when are closed with the word file.
Is in this point when i need to replace DoVerb() with some code that save the OLE Object into a PDF file.
In this point s contains the file i need, but i cant find the way to save it as file instead of only open it.
please help me, i have read articles many hours by now and didn't find the answer.
Upvotes: 1
Views: 2742
Reputation: 31
i found a workaround in the python-win32 mailing list...... thanks to Chris Else, is like some says in one comment, the .bin file cant be Transformed into a pdf, the code that Chris send me was:
import olefile
from zipfile import ZipFile
from glob import glob
# How many PDF documents have we saved
pdf_count = 0
# Loop through all the .docx files in the current folder
for filename in glob("*.docx"):
try:
# Try to open the document as ZIP file
with ZipFile(filename, "r") as zip:
# Find files in the word/embeddings folder of the ZIP file
for entry in zip.infolist():
if not entry.filename.startswith("word/embeddings/"):
continue
# Try to open the embedded OLE file
with zip.open(entry.filename) as f:
if not olefile.isOleFile(f):
continue
ole = olefile.OleFileIO(f)
# CLSID for Adobe Acrobat Document
if ole.root.clsid != "B801CA65-A1FC-11D0-85AD-444553540000":
continue
if not ole.exists("CONTENTS"):
continue
# Extract the PDF from the OLE file
pdf_data = ole.openstream('CONTENTS').read()
# Does the embedded file have a %PDF- header?
if pdf_data[0:5] == b'%PDF-':
pdf_count += 1
pdf_filename = "Document %d.pdf" % pdf_count
# Save the PDF
with open(pdf_filename, "wb") as output_file:
output_file.write(pdf_data)
except:
print("Unable to open '%s'" % filename)
print("Extracted %d PDF documents" % pdf_count)
Upvotes: 2