Reputation: 52338
I'm using Adobe Acrobat Pro to extract information from PDFs in XML format. Acrobat does this particularly well. I want to extract information from about a thousand documents and do stuff with that information, so using Acrobat by hand would be annoying. Are there plugins to call Acrobat functions (i.e. save as XML) from any common language, ideally Python?
Upvotes: 4
Views: 6933
Reputation:
Maybe you could take a look at pypdf? It allows python reference to Adobe PDF's. Also PDFminer allows pdf xml extracting. I know perl can do it because I have previously used it myself, here is the reference to the module CAM::PDF
Example:
from pyPdf import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input1 = PdfFileReader(file("document1.pdf", "rb"))
# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)
# add page 1 from input1 to output document, unchanged
output.addPage(input1.getPage(0))
# add page 2 from input1, but rotated clockwise 90 degrees
output.addPage(input1.getPage(1).rotateClockwise(90))
# add page 3 from input1, rotated the other way:
output.addPage(input1.getPage(2).rotateCounterClockwise(90))
# alt: output.addPage(input1.getPage(2).rotateClockwise(270))
# add page 4 from input1, but first add a watermark from another pdf:
page4 = input1.getPage(3)
watermark = PdfFileReader(file("watermark.pdf", "rb"))
page4.mergePage(watermark.getPage(0))
# add page 5 from input1, but crop it to half size:
page5 = input1.getPage(4)
page5.mediaBox.upperRight = (
page5.mediaBox.getUpperRight_x() / 2,
page5.mediaBox.getUpperRight_y() / 2
)
output.addPage(page5)
# print how many pages input1 has:
print "document1.pdf has %s pages." % input1.getNumPages()
# finally, write "output" to document-output.pdf
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()
Also take a look at this question: python and pyPdf - how to extract text from the pages so that there are spaces between lines. Describes XML parsing and such in PDF's.
Upvotes: 1
Reputation: 50210
If you're on Windows, you can talk to Acrobat using DDE commands. The pyWin32
module supports DDE calls, or you could try your luck with this stand-alone binding.
But you'll have to figure out the request to send to Acrobat. (here's some random documentation, but it doesn't mention XML). It seems that the commands change from version to version, (or at least some things break), so keep an eye on the version. Good luck.
Upvotes: 2