Nick
Nick

Reputation: 875

Convert PDF to text file using VBA and Adobe Acrobat XI standard

Part 3 of a previous post.

The task: I am attempting to iterate over a series of URLs presented in excel and generate complete text files for each.

So far: The VBA solution in the previous post uses Word to open the PDF Url, and from this generate a text file. This is a much better improvement than using power query. However, I have come across one instance where it fails as it recognises the text on the first page as an image; therefore, this text is omitted in the generated text file.

As an example of a truncated text file generated: https://hpvchemicals.oecd.org/ui/handler.axd?id=b4b38713-7580-4843-86fd-614821a6f72b is an example where the generated PDF text file has text missing.

As advised, I have downloaded a version of Adobe Acrobat (Adobe Acrobat XI Standard), which enables Adobe Acrobat 10.0 Type Library Reference (and a bunch of others). I hope this will enable more accurate text conversion but will need help to develop this script.

So, for this problem, I wish to modify the following code to swap Word for Abode to perform the conversion. What I am unsure of, however, is whether Adobe can open a PDF from a URL in the same way word can. I think I may need to introduce some steps to download the PDF first in the folder and then for each convert those. Ideally, It would work as the above code but if this is necessary, then so be it.

M Code from the previous answer:

Sub Tester()

    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim oWd As Object, oDoc As Object, c As Range, fileRoot As String
    
    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")
    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("D2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            Set oDoc = Nothing
            On Error Resume Next 'ignore error if no document...
            Set oDoc = oWd.Documents.Open(url)
            On Error GoTo 0      'stop ignoring errors
            If Not oDoc Is Nothing Then
                filePath = fileRoot & c.Offset(0, -1).Value & ".txt" 'filename from ColA
                Debug.Print filePath
                'open text stream as unicode
                Set fileStream = fso.CreateTextFile(filePath, overwrite:=True, Unicode:=True)
                fileStream.Write oDoc.Range.Text
                fileStream.Close
                oDoc.Close
                c.Interior.Color = vbGreen 'flag OK
            Else
                c.Interior.Color = vbRed   'flag problem
            End If
        End If 'have url
    Next c
    
    oWd.Quit
End Sub

As for developing this script, I have the following from what I have found online, which now successfully performs the conversion using Acrobat for the previously difficult PDF so Good start.

Sub convertpdf2()

    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim Filename As String
    Dim jsObj As Object
    Dim NewFileName As String

    Filename = "C:\temp\name1.pdf"
    NewFileName = "C:\temp\name1.txt"

    Set AcroXApp = CreateObject("AcroExch.App")
    'AcroXApp.Show

    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    AcroXAVDoc.Open Filename, "Acrobat"

    Set AcroXPDDoc = AcroXAVDoc.GetPDDoc


    Set jsObj = AcroXPDDoc.GetJSObject


    jsObj.SaveAs NewFileName, "com.adobe.acrobat.plain-text"


    AcroXAVDoc.Close False
    AcroXApp.Hide
    AcroXApp.Exit

    End Sub

Where name1 is a downloaded PDF.

So, in essence I am trying to:

I think if I can achieve this, then I will be able to work out how to merge this with the previous post so that it can loop through a given number of URLs to generate the text files as well as handle the non-unicofe characters.

Will update if I make any progress

Setup in Excel: enter image description here

URLs:

https://hpvchemicals.oecd.org/ui/handler.axd?id=e19d2799-0c16-496d-a607-b09330dd28a7
https://hpvchemicals.oecd.org/ui/handler.axd?id=40da06b1-a855-4c0c-bc21-bbc856dca725
https://hpvchemicals.oecd.org/ui/handler.axd?id=c4967546-1f5e-472a-b629-a2998323735b
https://hpvchemicals.oecd.org/ui/handler.axd?id=bde5e625-83ee-423d-aa70-eb0e453088e4
https://hpvchemicals.oecd.org/ui/handler.axd?id=621c4f55-ef3c-4b99-bb98-e6aaf3f436dd
https://hpvchemicals.oecd.org/ui/handler.axd?id=26e1420d-f9b7-4768-b6fa-d345f54e7683
https://hpvchemicals.oecd.org/ui/handler.axd?id=263f3491-90c7-4c3a-b43e-4c4e9395bcea
https://hpvchemicals.oecd.org/ui/handler.axd?id=b78d39a9-26c2-48ff-aadc-cb056a89f08b
https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16
https://hpvchemicals.oecd.org/ui/handler.axd?id=c6c3b7c1-9239-40d9-b51a-85a15e2411d6

Update: Attempting to add Error handling Highlighting

M Code:

#If VBA7 Then
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
      Alias "URLDownloadToFileA" (ByVal pCaller As LongPtr, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As LongPtr) As Long
#Else
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
    Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
#End If

Sub Tester1()
    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim c As Range, fileRoot As String, pdfPath As String
    
    
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object

    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("C2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            
            pdfPath = fileRoot & "PDF_" & c.Offset(0, -1).Value & ".pdf"
            
            Debug.Print pdfPath, c.Value
            
            DownloadFile url, pdfPath
            
            
            Set AcroXApp = CreateObject("AcroExch.App")
            Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
            AcroXAVDoc.Open pdfPath, "Acrobat" & c.Offset(0, -1).Value & ".txt"
            Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        
            If Error <> 0 Then
            
            c.Interior.Color = vbRed   'flag problem
            
            Else
    
            Set jsObj = AcroXPDDoc.GetJSObject
            jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
            AcroXAVDoc.Close False
            AcroXApp.Hide
            AcroXApp.Exit
            
            c.Interior.Color = vbGreen 'flag OK
            
            End If
             
            
        End If 'have url
        
        
    Next c


    kill pdfPath ' toggle on/off to keep PDF


End Sub

Function DownloadFile(sURL, sSaveAs) As Boolean
    DownloadFile = (URLDownloadToFile(0, sURL, sSaveAs, 0, 0) = 0)
End Function

Upvotes: 0

Views: 4230

Answers (2)

Tim Williams
Tim Williams

Reputation: 166835

Tested:

#If VBA7 Then
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
      Alias "URLDownloadToFileA" (ByVal pCaller As LongPtr, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As LongPtr) As Long
#Else
    Private Declare Function URLDownloadToFile Lib "urlmon" _
      Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
#End If

Function DownloadFile(sURL, sSaveAs) As Boolean
    DownloadFile = (URLDownloadToFile(0, sURL, sSaveAs, 0, 0) = 0)
End Function

Sub Tester()
    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim c As Range, fileRoot As String, pdfPath As String, success As Boolean
    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("D2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            
            pdfPath = fileRoot & "PDF_" & c.Offset(0, -1).Value & ".pdf"
            DownloadFile url, pdfPath
            
            If IsErrorPage(pdfPath) Then
                c.Interior.Color = vbRed
                c.Offset(0, 1).Value = "No PDF returned" 'flag in col C
            Else
                success = ConvertPdf2(pdfPath, fileRoot & c.Offset(0, -1).Value & ".txt")
                c.Interior.Color = IIf(success, vbGreen, vbRed)
                c.Offset(0, 1).Value = IIf(success, "OK", "PDF not openable")
            End If
            
        End If 'have url
    Next c
End Sub

'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object, success As Boolean

    Set AcroXApp = CreateObject("AcroExch.App")
    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
    If success Then
        Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        Set jsObj = AcroXPDDoc.GetJSObject
        jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
        AcroXAVDoc.Close False
    End If
    AcroXApp.Hide
    AcroXApp.Exit
    ConvertPdf2 = success 'report success/failure
End Function

'is the supposed PDF actually an HTML error page?
Function IsErrorPage(path) As Boolean
    Dim txt
    txt = CreateObject("scripting.filesystemobject"). _
                  OpenTextFile(path, 1).ReadAll()
    IsErrorPage = InStr(1, txt, "<!DOCTYPE", vbTextCompare) > 0
End Function

Upvotes: 2

Gove
Gove

Reputation: 1814

I'm not in a position to help with automating acrobat, but here is a procedure I use to download PDFs. Supply a URL to a PDF file and a local path to save the file. Only works on the windows version of Excel.

Sub saveFile(url As String, Optional path As String)

    Dim http As Object
    Dim objfso As Object
    Dim objADOStream As Object
    
    If path = "" Then path = ThisWorkbook.path & "\file.pdf"
    
    Set http = CreateObject("MSXML2.serverXMLHTTP")
    http.Open "GET", url, False
    http.Send
 
    If http.Status = 200 Then
      Set objADOStream = CreateObject("ADODB.Stream")
      objADOStream.Open
      objADOStream.Type = 1 'adTypeBinary
 
      objADOStream.Write http.responsebody
      objADOStream.Position = 0    'Set the stream position to the start
 
      Set objfso = CreateObject("Scripting.FileSystemObject")
        If objfso.fileexists(path) Then objfso.DeleteFile path
      Set objfso = Nothing
 
      objADOStream.SaveToFile path
      objADOStream.Close
      Set objADOStream = Nothing
    End If
 
End Sub

Upvotes: 1

Related Questions