Reputation: 990
I have encountered such problem too in Mac, and just wanted to share my solution in bash script file with no addition needed application!
Upvotes: 2
Views: 3242
Reputation: 990
This script will extract all the pdf files embedded inside the word document.
Simply put the script file where your word.docx file is and run it (give it permissions first) like:
./extract_docx_objects.sh word.docx
The extracted files will be in the sub folder docx_zip/word/embeddings/.
Here's the code:
docx=$1
echo $docx
rm -rf docx_zip
mkdir -p docx_zip
cp $docx docx_zip/temp.zip
cd docx_zip/
unzip temp.zip
cd word/embeddings/
FILES=*.bin
echo `ls -la $FILES`
for f in $FILES
do
echo "processing $f..."
fname=${f%.*}
dd if=$f of=$fname.pdf bs=1
start=`xxd -b $f|grep %PDF -n|awk -F: '{print $1}'`
start1=$(((start-1)*6))
end=`xxd -b $f|grep %%EOF -n|awk -F: '{print $1}'`
end1=$(((end-1)*6+5*2))
dd skip=$start1 count=$end1 if=$f of=$fname.pdf bs=1
done
You can add a check if the folder already exists (as I didn't here) before deleting it.
Enjoy!
[INFO]
If you need a VBA macro in Windows to do the same, here's my solution:
There is a partial solution in VBA, and it needs preparation before you can run it:
The VBA macro:
Sub export_PDFs()
Dim Contents As String
Dim PDF As String
Dim hFile As Integer
Dim i As Long, j As Long
Dim ExtractedZippedDocxFolder, FileNameBin, FileNamePDF, BinFolderPath As String
Dim fileIndex As Integer
Dim dlgOpen As FileDialog
Set dlgOpen = Application.FileDialog( _
FileDialogType:=msoFileDialogFolderPicker)
With dlgOpen
.AllowMultiSelect = False
.Title = "Select the unzipped docx folder to extract PDF file(s) from"
.InitialFileName = "*.docx"
.Show
End With
ExtractedZippedDocxFolder = dlgOpen.SelectedItems.Item(1)
BinFolderPath = ExtractedZippedDocxFolder + "\word\embeddings"
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFolder = objFSO.GetFolder(BinFolderPath)
fileIndex = 0
For Each objFile In objFolder.Files
If LCase$(Right$(objFile.Name, 4)) = ".bin" Then
FileNameIndex = Left$(objFile.Name, Len(objFile.Name) - Len(".bin"))
FileNameBin = BinFolderPath + "\" + FileNameIndex + ".bin"
FileNamePDF = BinFolderPath + "\" + FileNameIndex + ".pdf"
hFile = FreeFile
Open FileNameBin For Binary Access Read As #hFile
Contents = String(LOF(hFile), vbNullChar)
Get #hFile, , Contents
Close #hFile
i = InStrB(1, Contents, "%PDF")
j = InStrB(i, Contents, "%%EOF")
If (InStrB(j + 1, Contents, "%%EOF") > 0) Then j = InStrB(j + 1, Contents, "%%EOF")
PDF = MidB(Contents, i, j + 5 - i + 12)
Open FileNamePDF For Binary Access Write As #hFile
Put #hFile, , PDF
Close #hFile
fileIndex = fileIndex + 1
End If
Next
If fileIndex = 0 Then
MsgBox "Unable to find any bin file in the givven unzipped docx file content"
Else
MsgBox Str(fileIndex) + " files were processed"
End If
End Sub
Upvotes: 3