Reputation: 977
I've been tasked with creating an Excel tool to import data from thousands of PDFs into Excel. I've installed Adobe Acrobat X Professional, and it seems to be working fine. I'm using Excel 2007.
I found sample code here: http://khkonsulting.com/2010/09/reading-pdf-form-fields-with-vba/
And modifying it only slightly to prompt me to browse to the PDF file, and setting a reference to "Adobe Acrobat 10.0 Type Library", I'm trying this code:
Sub Main()
Dim AcroApp As Acrobat.CAcroApp
Dim theForm As Acrobat.CAcroPDDoc
Dim jso As Object
Dim text1, text2 As String
Set AcroApp = CreateObject("AcroExch.App")
Set theForm = CreateObject("AcroExch.PDDoc")
Dim strFullPath As String
strFullPath = Application.GetOpenFilename()
theForm.Open (strFullPath) 'theForm.Open ("C:\temp\sampleForm.pdf")
Set jso = theForm.GetJSObject
' get the information from the form fields Text1 and Text2
text1 = jso.getfield("MFR_ctrl33605579").Value 'jso.getfield("Text1").Value
'(etc)
End Sub
("MFR_ctrl33605579" is the name of one of the text fields; I found that in the editor that is part of or included with Acrobat Pro, called "Adobe LiveCycle Designer".)
It runs without error until it gets to the getfield method. Then I get the error "Object required".
If I run TypeName on AcroApp, theForm, and jso, I get CAcroApp, CAcroPDDoc, and object respectively:
?typename(AcroApp)
CAcroApp
?typename(theForm)
CAcroPDDoc
?typename(jso)
Object
If I put them in the Watches window, I initially get plus signs to the left of all three of them, but if I click on those plus signs, only one line appears under each one, saying "< No Variables >" in the Value column.
I wondered if the document was somehow locked against being read in this way. I read that if it was locked, a padlock would appear at the upper left in Acrobat, and it would not allow saving as text. But no padlock appears that I see, and it does allow saving as plain text. So as far as I can tell it doesn't seem to be locked.
Any suggestions on what I can try?
UPDATE:
I'm just getting started trying the suggestions, but I wanted to note that after trying rheitzman's For loop to get the field name (using getNthFieldName), I see that my code works if I use this for the field name:
form1[0].QuestionnaireForm[0].sbfrmProfile[0].sbfrmContact[0].sbfrmManufacturerDetails[0].MFR_ctrl33605579[0]
Or in other words:
text1 = jso.getfield("MFR_form1[0].QuestionnaireForm[0].sbfrmProfile[0].sbfrmContact[0].sbfrmManufacturerDetails[0].MFR_ctrl33605579[0]").Value
That would allow me to identify fields by that very long, apparently fully-qualified reference, which would get me through the project. But first I'm going to check the other ideas to see if the fields can be found by short name only.
UPDATE 2:
I see now that I can inspect some details of some of the object model in the Object Browser (filter Libraries on Acrobat), even though it doesn't show up in the Watch window.
However, it doesn't show objects created by methods, such as the object created by the AcroExch.PDDoc.GetJSObject method (see AcroPDDoc in the object model). The method of that object used here, getNthFieldName, does not appear in the object browser at all.
And, I don't see any other way there of identifying the fields by their short field names.
So ... for this project, I'm just going to use the long field names returned by getNthFieldName.
Upvotes: 1
Views: 6199
Reputation: 107587
Not quite intuitive in the Adobe Acrobat OLE documentation but connsider using the AvDoc (UI) object as a "middle man" between document and app (Form < UI < App
). Essentially, the AvDoc is the user interface object that the user interacts with while the PDDoc is the background object the user does not see but Adobe Acrobat maintains access to.
Notice too I open the document with formDoc
, a different PDDoc, than the one used for JSObject as formDoc
inherits from the AVDoc object. Again, this automation is not well documented as Adobe leans more to Javascript than VBA, so this is mostly learned through trial and error.
Sub Main()
Dim AcroApp As Acrobat.CAcroApp
Dim formUI As Acrobat.CAcroAVDoc
Dim srcDoc As Acrobat.CAcroPDDoc, formDoc As Acrobat.CAcroPDDoc
Dim jso As Object
Dim strFullPath As String, text1 As String, text2 As String
Set AcroApp = CreateObject("AcroExch.App")
Set formUI = CreateObject("AcroExch.AVDoc")
Set srcDoc = CreateObject("AcroExch.PDDoc")
strFullPath = Application.GetOpenFilename()
srcDoc.Open (strFullPath)
If formUI.Open(strFullPath, "") = True Then
Set formDoc = formUI.GetPDDoc()
Set jso = formDoc.GetJSObject()
' EXTRACT FORM FIELDS' TEXT
text1 = jso.GetField("MFR_ctrl33605579").Value
End If
' UNINTIALIZING PDF OBJECTS
Set jso = Nothing
Set formDoc = Nothing
Set srcDoc = Nothing
Set formUI = Nothing
Set AcroApp = Nothing
End Sub
Upvotes: 0
Reputation: 2297
iTextSharp is much easier to use!
That said, here's a snippet to read all fields.
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
Dim pdfForm As Acrobat.AcroPDDoc
Dim jso As Object ' ?
Dim s As String = ""
Dim i As Integer
Dim strFullPath As String = "H:\OIS\ENFORCE\OPEN_BURN\Ag\temp\Open_Burn_Template_Out.pdf"
Try
pdfForm = New Acrobat.AcroPDDoc
pdfForm.Open(strFullPath)
jso = pdfForm.GetJSObject
For i = 0 To jso.numfields() - 1
s = jso.getNthFieldName(i)
Debug.Print(s & ": " & jso.getField(s).value)
Next
Catch ex As Exception
MsgBox(ex.Message)
End Try
End Sub
It could be your "form" doesn't have any fields (or a flattened form), or the names are off.
The JSObject properties are a bit of a mystery. I found a post that used reflection to extract data but it just so happens that the JSObject properties are usable if you know what they are! e.g. numfields, getNthFieldName
If anyone finds a reference link for JSObject please post in comments.
Upvotes: 2