Nick
Nick

Reputation: 775

Extracting whole Sentences from PDFs (as best as possible) - Plain Text From PDF without inserting line breaks

I believe I have finally come up with a way to extract plain text without line breaks whilst retaining intended carriage returns from PDFs using VBA, Acrobat and Word Combined.

Previous answers using either word or acrobat independently ran into their own issues. Word would occasionally omit text interpreted as images, and Acrobat sometimes would not handle complex structures of PDFs and generate a blank text file.

Having tinkered with word, I realise that it has the option to generate plain text without linebreaks as shown below. Importantly the text generated retains intended carriage returns.

enter image description here

Acrobat does this automatically, too, when generating a plain text file; however, with the issue of unstructured PDFs, I think word is the better bet. And also likely more controllably with VBA.

By combining the two in VBA, I believe I have omitted many of the issues. The text files generated are much more than what I have been after for the past few days. i.e. sentences are not broken with line breaks.

The VBA code below works as follows:

  1. Convert all PDFs contained within a folder to word (using acrobat to ensure no part of the PDF is omitted)
  2. Use words to achieve the conversion to plain text.

Update: 21/12/22 The below code uses FileFormat:=wdFormatText which maybe more straight forward.

Sub ConvertDocumentsToTxt()
'Updated by Extendoffice 20181123
    Dim xIndex As Long
    Dim xFolder As Variant
    Dim xFileStr As String
    Dim xFilePath As String
    Dim xDlg As FileDialog
    Dim xActPath As String
    Dim xDoc As Document
    Application.ScreenUpdating = False
    Set xDlg = Application.FileDialog(msoFileDialogFolderPicker)
    If xDlg.Show <> -1 Then Exit Sub
    xFolder = xDlg.SelectedItems(1)
    xFileStr = Dir(xFolder & "\*.doc")
    xActPath = ActiveDocument.Path
    While xFileStr <> ""
        xFilePath = xFolder & "\" & xFileStr
        If xFilePath <> xActPath Then
            Set xDoc = Documents.Open(xFilePath, AddToRecentFiles:=False, Visible:=False)
            xIndex = InStrRev(xFilePath, ".")
            Debug.Print Left(xFilePath, xIndex - 1) & ".txt"
            xDoc.SaveAs Left(xFilePath, xIndex - 1) & ".txt", FileFormat:=wdFormatText, AddToRecentFiles:=False
            xDoc.Close True
        End If
        xFileStr = Dir()
    Wend
    Application.ScreenUpdating = True
End Sub

So far: (Updated now improved - Same as submitted answer) I have created the following working script in VBA, which achieves these two steps:

References, Acrobat, and Microsoft Scripting Runtime.

Sub LoopThroughFiles()
    
    Dim StrFile As String
    Dim pdfPath As String
    
    StrFile = Dir("C:\temp\PDFs\")
    fileRoot = "C:\temp\PDFs\"
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    Do While Len(StrFile) > 0
        
        Debug.Print StrFile
        n = StrFile
        pdfPath = fileRoot & StrFile
        
        Debug.Print pdfPath
        
    'Convert to WordDoc
    success = ConvertPdf2(pdfPath, fileRoot & StrFile & ".doc")
    StrFile = Dir
    On Error Resume Next
        
    oWd.Quit
        
    'Convert to PlainText
    Debug.Print pdfPath & ".doc"

    success2 = GetTextFromWord(pdfPath & ".doc", n)
    
Loop
End Sub

'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object, success As Boolean

    Set AcroXApp = CreateObject("AcroExch.App")
    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
    If success Then
    
Application.Wait (Now + TimeValue("0:00:2")) 'Helps PC have some time to go through data, can cause PC to freeze without

        Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        Set jsObj = AcroXPDDoc.GetJSObject
        jsObj.SaveAs textPath, "com.adobe.acrobat.doc"
        AcroXAVDoc.Close False
    End If
    AcroXApp.Hide
    AcroXApp.Exit
    ConvertPdf2 = success 'report success/failure
End Function

Function GetTextFromWord(DocStr As String, n)

    Dim filePath As String
    Dim fso As FileSystemObject
    Dim fileStream As TextStream
    Dim oWd As Object, oDoc As Object, fileRoot As String
    Const wdFormatText As Long = 2, wdCRLF As Long = 0
    
    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")
    
    fileRoot = "C:\temp\PDFs" 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
            Set oDoc = Nothing
            On Error Resume Next 'ignore error if no document...
            Set oDoc = oWd.Documents.Open(DocStr)
            On Error GoTo 0      'stop ignoring errors
            
            Debug.Print n
            If Not oDoc Is Nothing Then
                filePath = fileRoot & n & ".txt"  'filename
                Debug.Print filePath
                
                
        oDoc.SaveAs2 Filename:=filePath, _
        FileFormat:=wdFormatText, LockComments:=False, Password:="", _
        AddToRecentFiles:=True, WritePassword:="", ReadOnlyRecommended:=False, _
        EmbedTrueTypeFonts:=False, SaveNativePictureFormat:=False, SaveFormsData _
        :=False, SaveAsAOCELetter:=False, Encoding:=1252, InsertLineBreaks:=False _
        , AllowSubstitutions:=True, LineEnding:=wdCRLF, CompatibilityMode:=0
        
    oDoc.Close False
    
    End If
    oWd.Quit
                
   
    GetTextFromWord = success2
    
End Function

Please note I am not good at all with VBA; much of this is stitching together answers previously provided and trying to get it to loop through. I am hoping someone with VBA experience can review this and really make it more robust.

It does work, albeit quite slowly, to generate the doc files and then text files:

enter image description here

I hope someone familiar with VBA can help me make this solution more robust.

The files can be downloaded here: https://1drv.ms/u/s!AsrLaUgt0KCLhXtP-jYDd4Z0ujKQ?e=2b6DNg

Add to a PDF folder in temp, and the code should run okay.

Please let me know if you require any more information. I think this is it after a week of questions. Just the code needs tidying up.

Finally, if anyone who comes across this knows of any program that can generate plain text without inserting line breaks but retaining carriage returns, please let me know. Acrobat would be the solution and does work for most cases but has to generate tags on some PDFs, which has failed in my case. I would be very interested in an existing program that can in Batch convert PDFs in this way.

Upvotes: 0

Views: 957

Answers (2)

Daniel Sanders
Daniel Sanders

Reputation: 1

Try using below:

  strTemp = Replace(FromString, vbCr, " ")


 strTemp = Replace(strTemp, vbLf, " ")
 strTemp = Replace(strTemp, vbNewline," ")

I use the free tool xpf reader to convert a pdf.

Upvotes: 0

Nick
Nick

Reputation: 775

Improved Answer that enables word parameters

ChangeEncoding:=1252 to 65001 for unusual characters(Added below):

Sub LoopThroughFiles()
    
    Dim StrFile As String
    Dim pdfPath As String
    
    StrFile = Dir("C:\temp\PDFs\")
    fileRoot = "C:\temp\PDFs\"
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    Do While Len(StrFile) > 0
        
        Debug.Print StrFile
        n = StrFile
        pdfPath = fileRoot & StrFile
        
        Debug.Print pdfPath
        
    'Convert to WordDoc
    success = ConvertPdf2(pdfPath, fileRoot & StrFile & ".doc")
    StrFile = Dir
    On Error Resume Next
        
    oWd.Quit
        
    'Convert to PlainText
    Debug.Print pdfPath & ".doc"

    success2 = GetTextFromWord(pdfPath & ".doc", n)
    
Loop
End Sub

'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object, success As Boolean

    Set AcroXApp = CreateObject("AcroExch.App")
    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
    If success Then
    
Application.Wait (Now + TimeValue("0:00:2")) 'Helps PC have some time to go through data, can cause PC to freeze without

        Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        Set jsObj = AcroXPDDoc.GetJSObject
        jsObj.SaveAs textPath, "com.adobe.acrobat.doc"
        AcroXAVDoc.Close False
    End If
    AcroXApp.Hide
    AcroXApp.Exit
    ConvertPdf2 = success 'report success/failure
End Function

Function GetTextFromWord(DocStr As String, n)

    Dim filePath As String
    Dim fso As FileSystemObject
    Dim fileStream As TextStream
    Dim oWd As Object, oDoc As Object, fileRoot As String
    Const wdFormatText As Long = 2, wdCRLF As Long = 0
    
    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")
    
    fileRoot = "C:\temp\PDFs" 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
            Set oDoc = Nothing
            On Error Resume Next 'ignore error if no document...
            Set oDoc = oWd.Documents.Open(DocStr)
            On Error GoTo 0      'stop ignoring errors
            
            Debug.Print n
            If Not oDoc Is Nothing Then
                filePath = fileRoot & n & ".txt"  'filename
                Debug.Print filePath
                
                
        oDoc.SaveAs2 Filename:=filePath, _
        FileFormat:=wdFormatText, LockComments:=False, Password:="", _
        AddToRecentFiles:=False, WritePassword:="", ReadOnlyRecommended:=False, _
        EmbedTrueTypeFonts:=False, SaveNativePictureFormat:=False, SaveFormsData _
        :=False, SaveAsAOCELetter:=False, Encoding:=65001, InsertLineBreaks:=False _
        , AllowSubstitutions:=True, LineEnding:=wdCRLF, CompatibilityMode:=0
        
    oDoc.Close False
    
    End If
    oWd.Quit
                
   
    GetTextFromWord = success2
    
End Function

Upvotes: 1

Related Questions