Convert PDF to text file using VBA and Adobe Acrobat XI standard

Question

The task: I am attempting to iterate over a series of URLs presented in excel and generate complete text files for each.

So far: The VBA solution in the previous post uses Word to open the PDF Url, and from this generate a text file. This is a much better improvement than using power query. However, I have come across one instance where it fails as it recognises the text on the first page as an image; therefore, this text is omitted in the generated text file.

As an example of a truncated text file generated: https://hpvchemicals.oecd.org/ui/handler.axd?id=b4b38713-7580-4843-86fd-614821a6f72b is an example where the generated PDF text file has text missing.

As advised, I have downloaded a version of Adobe Acrobat (Adobe Acrobat XI Standard), which enables Adobe Acrobat 10.0 Type Library Reference (and a bunch of others). I hope this will enable more accurate text conversion but will need help to develop this script.

So, for this problem, I wish to modify the following code to swap Word for Abode to perform the conversion. What I am unsure of, however, is whether Adobe can open a PDF from a URL in the same way word can. I think I may need to introduce some steps to download the PDF first in the folder and then for each convert those. Ideally, It would work as the above code but if this is necessary, then so be it.

M Code from the previous answer:

Sub Tester()

    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim oWd As Object, oDoc As Object, c As Range, fileRoot As String
    
    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")
    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("D2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            Set oDoc = Nothing
            On Error Resume Next 'ignore error if no document...
            Set oDoc = oWd.Documents.Open(url)
            On Error GoTo 0      'stop ignoring errors
            If Not oDoc Is Nothing Then
                filePath = fileRoot & c.Offset(0, -1).Value & ".txt" 'filename from ColA
                Debug.Print filePath
                'open text stream as unicode
                Set fileStream = fso.CreateTextFile(filePath, overwrite:=True, Unicode:=True)
                fileStream.Write oDoc.Range.Text
                fileStream.Close
                oDoc.Close
                c.Interior.Color = vbGreen 'flag OK
            Else
                c.Interior.Color = vbRed   'flag problem
            End If
        End If 'have url
    Next c
    
    oWd.Quit
End Sub

As for developing this script, I have the following from what I have found online, which now successfully performs the conversion using Acrobat for the previously difficult PDF so Good start.

Sub convertpdf2()

    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim Filename As String
    Dim jsObj As Object
    Dim NewFileName As String

    Filename = "C:	emp
ame1.pdf"
    NewFileName = "C:	emp
ame1.txt"

    Set AcroXApp = CreateObject("AcroExch.App")
    'AcroXApp.Show

    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    AcroXAVDoc.Open Filename, "Acrobat"

    Set AcroXPDDoc = AcroXAVDoc.GetPDDoc


    Set jsObj = AcroXPDDoc.GetJSObject


    jsObj.SaveAs NewFileName, "com.adobe.acrobat.plain-text"


    AcroXAVDoc.Close False
    AcroXApp.Hide
    AcroXApp.Exit

    End Sub

Where name1 is a downloaded PDF.

So, in essence I am trying to:

Download the PDF from a List of URLs/Open the PDF to a specified folder (if necessary)
use VBA to convert this to text

I think if I can achieve this, then I will be able to work out how to merge this with the previous post so that it can loop through a given number of URLs to generate the text files as well as handle the non-unicofe characters.

Will update if I make any progress

Setup in Excel:

URLs:

https://hpvchemicals.oecd.org/ui/handler.axd?id=e19d2799-0c16-496d-a607-b09330dd28a7
https://hpvchemicals.oecd.org/ui/handler.axd?id=40da06b1-a855-4c0c-bc21-bbc856dca725
https://hpvchemicals.oecd.org/ui/handler.axd?id=c4967546-1f5e-472a-b629-a2998323735b
https://hpvchemicals.oecd.org/ui/handler.axd?id=bde5e625-83ee-423d-aa70-eb0e453088e4
https://hpvchemicals.oecd.org/ui/handler.axd?id=621c4f55-ef3c-4b99-bb98-e6aaf3f436dd
https://hpvchemicals.oecd.org/ui/handler.axd?id=26e1420d-f9b7-4768-b6fa-d345f54e7683
https://hpvchemicals.oecd.org/ui/handler.axd?id=263f3491-90c7-4c3a-b43e-4c4e9395bcea
https://hpvchemicals.oecd.org/ui/handler.axd?id=b78d39a9-26c2-48ff-aadc-cb056a89f08b
https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16
https://hpvchemicals.oecd.org/ui/handler.axd?id=c6c3b7c1-9239-40d9-b51a-85a15e2411d6

Update: Attempting to add Error handling Highlighting

M Code:

#If VBA7 Then
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
      Alias "URLDownloadToFileA" (ByVal pCaller As LongPtr, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As LongPtr) As Long
#Else
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
    Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
#End If

Sub Tester1()
    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim c As Range, fileRoot As String, pdfPath As String
    
    
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object

    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("C2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            
            pdfPath = fileRoot & "PDF_" & c.Offset(0, -1).Value & ".pdf"
            
            Debug.Print pdfPath, c.Value
            
            DownloadFile url, pdfPath
            
            
            Set AcroXApp = CreateObject("AcroExch.App")
            Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
            AcroXAVDoc.Open pdfPath, "Acrobat" & c.Offset(0, -1).Value & ".txt"
            Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        
            If Error <> 0 Then
            
            c.Interior.Color = vbRed   'flag problem
            
            Else
    
            Set jsObj = AcroXPDDoc.GetJSObject
            jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
            AcroXAVDoc.Close False
            AcroXApp.Hide
            AcroXApp.Exit
            
            c.Interior.Color = vbGreen 'flag OK
            
            End If
             
            
        End If 'have url
        
        
    Next c


    kill pdfPath ' toggle on/off to keep PDF


End Sub

Function DownloadFile(sURL, sSaveAs) As Boolean
    DownloadFile = (URLDownloadToFile(0, sURL, sSaveAs, 0, 0) = 0)
End Function

Tim Williams · Accepted Answer

Tested:

#If VBA7 Then
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
      Alias "URLDownloadToFileA" (ByVal pCaller As LongPtr, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As LongPtr) As Long
#Else
    Private Declare Function URLDownloadToFile Lib "urlmon" _
      Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
#End If

Function DownloadFile(sURL, sSaveAs) As Boolean
    DownloadFile = (URLDownloadToFile(0, sURL, sSaveAs, 0, 0) = 0)
End Function

Sub Tester()
    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim c As Range, fileRoot As String, pdfPath As String, success As Boolean
    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("D2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            
            pdfPath = fileRoot & "PDF_" & c.Offset(0, -1).Value & ".pdf"
            DownloadFile url, pdfPath
            
            If IsErrorPage(pdfPath) Then
                c.Interior.Color = vbRed
                c.Offset(0, 1).Value = "No PDF returned" 'flag in col C
            Else
                success = ConvertPdf2(pdfPath, fileRoot & c.Offset(0, -1).Value & ".txt")
                c.Interior.Color = IIf(success, vbGreen, vbRed)
                c.Offset(0, 1).Value = IIf(success, "OK", "PDF not openable")
            End If
            
        End If 'have url
    Next c
End Sub

'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object, success As Boolean

    Set AcroXApp = CreateObject("AcroExch.App")
    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
    If success Then
        Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        Set jsObj = AcroXPDDoc.GetJSObject
        jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
        AcroXAVDoc.Close False
    End If
    AcroXApp.Hide
    AcroXApp.Exit
    ConvertPdf2 = success 'report success/failure
End Function

'is the supposed PDF actually an HTML error page?
Function IsErrorPage(path) As Boolean
    Dim txt
    txt = CreateObject("scripting.filesystemobject"). _
                  OpenTextFile(path, 1).ReadAll()
    IsErrorPage = InStr(1, txt, " 0
End Function

Convert PDF to text file using VBA and Adobe Acrobat XI standard

Answers (2)

Related Questions