Reputation: 5206
I'm making a web interface to autofill pdf forms with user data from a database. The admin needs to be able to upload a pdf (right now targeted at IRS pdf forms) and then associate the fields in the pdf with data fields in the database.
I need a way to help the admin associate the field names (stuff like "topmostSubform[0].Page2[0].p2-t66[0]") with the the data fields in the database. I'm looking for a way to modify the PDF programatically to in some way provide this information.
Basically I'm open to suggestions on how I might make the field names appear in an obvious manner on a modified version of the original pdf. The closest I've gotten is being able to insert Tooltips into the fields in the pdf by just editting the raw pdf line by line. However when editting the pdf in this manner the field names are gibberish, and so I can't just use them.
An optimal solution would be anything that could automatically parse a pdf and set each field's tooltip to be the fields name. Anything that can be run from the command line, or any python tool, or just a basic how to correctly parse a field's name from a raw pdf file would be amazing.
Upvotes: 9
Views: 4950
Reputation: 11877
Government Forms are usually not a standard PDF but a JavaScript driven XFA in a PDF wrapper, thus to enter the data programmatically you need a lookup table as the order is rarely the visual order.
Here the first field "single yes or no" "topmostSubform[0].Page1[0].c1_01[0]" is a checkbox designated well down the list of entries. of course none in this Form are "topmostSubform[0].Page2[0].p2-t66[0]" so you need totally different look-up table for each XFA. Otherwise follow the entries (luckilly there is some sence of sequencing in this form) so free format field "topmostSubform[0].Page1[0].f1_01[0]" is near "dependent:" etc.
There are XFA dedicated applications that can extract the positions of static fields, but if the fields are dynamically adapting then the page position would be a moving feast.
For XFA you need an intelligent dedicated Adobe listing (often xml / xlsx input output supplied on request from the relevant department), or build your own if Acrobat Pro does not block the attempt.
Upvotes: 0
Reputation: 119
The SDAPS framework was designed for scenarios like this: It aids in batch-processing PDF-based forms, extract contents from designated fields and e.g. funnel those into a database for further processing.
Upvotes: -1
Reputation: 1987
This may be way off your intended track; but, it might be worth a think. I've been working on parsing scanned structured documents into Django model instances. Using tesseract and unpaper to do the pre-processing and OCR, I get over 99% accuracy. That lets me parse the OCR output text with the Levenshtein
and re
modules and do a simple new_instance = MyModel(parsed1, parsed2, ...)
.
It seems that you are trying to do something similar. Looking at the forms at http://www.irs.gov/formspubs/ They tend to have text labels left-adjacent to the fields. Using something like py-tesseract, you should be able to OCR the labels, overlay the OCR text over the form image and allow the user to select/edit the field labels.
There is a nice little tool, ocrfeeder
https://live.gnome.org/OCRFeeder, that is written in python and should give you a basic idea of how the process works in a desktop app. Good luck.
Upvotes: 0
Reputation: 1987
A postscript parser lives here: https://github.com/haxwithaxe/py-ps-parser
I've been interested in playing with it, but haven't yet.
Upvotes: 0
Reputation: 513
I may be interpreting the question wrong but I have a lot of experience in pdf generation with python/django because of the site that I worked on for 5 months. I would suggest using texlive. Basically what I did was built a generic tex template for a document and then used django templating to insert the fields. I rendered the template as if it were html using render_to_string and then generated it using the pdflatex command. I ran pdflatex using pythons subprocess module and a little extra. To do the generating I used this guys pdflatex module http://bit.ly/KaDMBp , with some modifications. All the things you need are in the core.py inside of the pdflatex directory.
Ex tex document (test.tex) )
\begin{document}
my name is {{input_name}} and i live in {{input_location}}.
\end{document}
Ex rendering template with django templating and render_to_string )
params={input_name:"andrew",input_location:"nyc"}
tex_doc = render_to_string('test.tex', params)
Ex generating as pdf)
pdflatex = PDFLatex(texfile=tex_path,outputdir=pdf_path)
pdflatex.transform()
Latex has a somewhat annoying, difficult learning curve but if you put in the time you can learn what you need to know in order to create these pdfs.
Hope this helps.
Upvotes: 0
Reputation: 1961
There may be an easier solution than this, but you could definitely get the job done with http://www.reportlab.com/software/opensource/rl-toolkit/'>ReportLab.
If you can save the current tax forms as an image, you could determine where each of the items need to be written and develop your code so that it automatically layers the appropriate values from the database on top of the image (the tax form, or whatever it might be).
Once you've determined 1) What fields need to be pulled from the database, and 2) where they 're supposed to go on within the form...
this is essentially what you'd be doing:
from reportlab.pdfgen import canvas
report_string_values = ['Alex',500,500],['Guido',400,400],
c = canvas.Canvas('hello.pdf')
c.drawImage(background_image,x_pos,y_pos) # x_pos and w_pos are # pixels from bl origin
for rsv in report_string_values:
c.drawString(rsv.x_pos,rsv.,rsv.text)
c.showPage()
c.save()
Upvotes: 0