Shai Alon
Shai Alon

Reputation: 990

How to extract embedded PDF from WORD document in Linux (Mac)

I have encountered such problem too in Mac, and just wanted to share my solution in bash script file with no addition needed application!

Upvotes: 2

Views: 3242

Answers (1)

Shai Alon
Shai Alon

Reputation: 990

This script will extract all the pdf files embedded inside the word document.

Simply put the script file where your word.docx file is and run it (give it permissions first) like:

./extract_docx_objects.sh word.docx

The extracted files will be in the sub folder docx_zip/word/embeddings/.

Here's the code:

docx=$1
echo $docx
rm -rf docx_zip
mkdir -p docx_zip
cp $docx docx_zip/temp.zip
cd docx_zip/
unzip temp.zip
cd word/embeddings/
FILES=*.bin
echo `ls -la $FILES`
for f in $FILES
do
    echo "processing $f..."
    fname=${f%.*}
    dd if=$f of=$fname.pdf bs=1
    start=`xxd -b $f|grep %PDF -n|awk -F: '{print $1}'`
    start1=$(((start-1)*6))
    end=`xxd -b $f|grep %%EOF -n|awk -F: '{print $1}'`
    end1=$(((end-1)*6+5*2))
    dd skip=$start1 count=$end1 if=$f of=$fname.pdf bs=1
done

You can add a check if the folder already exists (as I didn't here) before deleting it.

Enjoy!


[INFO]

If you need a VBA macro in Windows to do the same, here's my solution:

There is a partial solution in VBA, and it needs preparation before you can run it:

  1. Copy/rename your Word.docx file into Word.zip
  2. With your zip software, extract the Word.zip (same folder or other)
  3. Run the VBA macro from any word document - it will ask you where is the unzipped folder location.
  4. Once done, the ODF files will be located in Word/word/embeddings sub folder.

The VBA macro:

Sub export_PDFs()
    Dim Contents As String
    Dim PDF As String
    Dim hFile As Integer
    Dim i As Long, j As Long
    Dim ExtractedZippedDocxFolder, FileNameBin, FileNamePDF, BinFolderPath As String
    Dim fileIndex As Integer
   
    Dim dlgOpen As FileDialog
    Set dlgOpen = Application.FileDialog( _
    FileDialogType:=msoFileDialogFolderPicker)
    With dlgOpen
        .AllowMultiSelect = False
        .Title = "Select the unzipped docx folder to extract PDF file(s) from"
        .InitialFileName = "*.docx"
        .Show
    End With
    ExtractedZippedDocxFolder = dlgOpen.SelectedItems.Item(1)
    BinFolderPath = ExtractedZippedDocxFolder + "\word\embeddings"
    Set objFSO = CreateObject("Scripting.FileSystemObject")
    Set objFolder = objFSO.GetFolder(BinFolderPath)
    fileIndex = 0
   
    For Each objFile In objFolder.Files
        If LCase$(Right$(objFile.Name, 4)) = ".bin" Then
            FileNameIndex = Left$(objFile.Name, Len(objFile.Name) - Len(".bin"))
            FileNameBin = BinFolderPath + "\" + FileNameIndex + ".bin"
            FileNamePDF = BinFolderPath + "\" + FileNameIndex + ".pdf"
       
            hFile = FreeFile
            Open FileNameBin For Binary Access Read As #hFile
            Contents = String(LOF(hFile), vbNullChar)
            Get #hFile, , Contents
            Close #hFile
       
            i = InStrB(1, Contents, "%PDF")
            j = InStrB(i, Contents, "%%EOF")
            If (InStrB(j + 1, Contents, "%%EOF") > 0) Then j = InStrB(j + 1, Contents, "%%EOF")
       
            PDF = MidB(Contents, i, j + 5 - i + 12)
       
            Open FileNamePDF For Binary Access Write As #hFile
            Put #hFile, , PDF
            Close #hFile
            fileIndex = fileIndex + 1
        End If
    Next
    If fileIndex = 0 Then
        MsgBox "Unable to find any bin file in the givven unzipped docx file content"
    Else
        MsgBox Str(fileIndex) + "  files were processed"
    End If

End Sub
   

Upvotes: 3

Related Questions