How to use Python to pick up texts on PDF documents

Question

I have tons of commercial invoices to work with, in PDF format. Some information such the billing party, transaction occurred date and amount of money are needed to be picked.

In another word, I need to copy these information from each commercial invoice and paste them into an Excel spreadsheet.

These information are all at the same position on the PDF document, always the same place on each PDF.

Is there a way that I can have Python to pick up the information and store them into Excel spreadsheet, instead of manually copy&paste?

Thanks.

sundar nataraj · Accepted Answer

to read the pdf file you can use StringIO

from StringIO import StringIO


pdfContent = StringIO(getPDFContent("Billineg.pdf").encode("ascii", "ignore"))
for line in pdfContent:
    print line

other option you can use pypdf

small example

from pyPdf import  PdfFileReader    
input1 = PdfFileReader(file("Billineg.pdf", "rb"))    
# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)

after extracting data you can write them into csv or for excel you can use xlwt

getpdf content is method

import pyPdf  
def getPDFContent(path):
    content = ""
    num_pages = 10
    p = file(path, "rb")
    pdf = pyPdf.PdfFileReader(p)
    for i in range(0, num_pages):
        content += pdf.getPage(i).extractText() + "
"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())     
    return content

How to use Python to pick up texts on PDF documents

Answers (1)

Related Questions