Reputation: 9374
I have tons of commercial invoices to work with, in PDF format. Some information such the billing party, transaction occurred date and amount of money are needed to be picked.
In another word, I need to copy these information from each commercial invoice and paste them into an Excel spreadsheet.
These information are all at the same position on the PDF document, always the same place on each PDF.
Is there a way that I can have Python to pick up the information and store them into Excel spreadsheet, instead of manually copy&paste?
Thanks.
Upvotes: 1
Views: 809
Reputation: 8702
to read the pdf file you can use StringIO
from StringIO import StringIO
pdfContent = StringIO(getPDFContent("Billineg.pdf").encode("ascii", "ignore"))
for line in pdfContent:
print line
other option you can use pypdf
small example
from pyPdf import PdfFileReader
input1 = PdfFileReader(file("Billineg.pdf", "rb"))
# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)
after extracting data you can write them into csv
or for excel you can use xlwt
getpdf content is method
import pyPdf
def getPDFContent(path):
content = ""
num_pages = 10
p = file(path, "rb")
pdf = pyPdf.PdfFileReader(p)
for i in range(0, num_pages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
Upvotes: 2