Reputation: 38
Hoping for some help, as I can't find a solution.
We currently have a lot of manual data inputs through people reading PDF files, and I have been asked to find a way to cut this time down. My solution would be to transform the PDF to a much easier readable format, then using grep to get rid of the standard fields (Just leaving the data behind). This would then be uploaded into a template, then into SAP.
However, then main problem has come at the first hurdle - transforming the PDF into a txt file. The code I use is as follows -
import sys
import pyPdf
def getPDFContent(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
f = open('test.txt', 'w+')
f.write(getPDFContent("Adminform.pdf").encode("ascii", "ignore"))
f.close()
This works, however it ignores some data from the PDF files. To show you what I mean, this PDF page -
http://s23.postimg.org/6dqykomqj/error.png
From the first section (gender, title, name) produces the below -
*Title: *Legal First Name (s): *Your forename and second name (if applicable) as it appears on your passport or birth certificate. Address: *Legal Surname: *Your surname as it appears on your passport or birth certificate
Basically, the actual data that I want to capture is not being converted.
Anyone have a fix for this?
Thanks,
Upvotes: 0
Views: 1349
Reputation: 61
Generally speaking converting pdfs to text is a bad idea. It almost always is messy.
There are linux utilities to do what you have implemented, but I don't expect them to do any better.
I can suggest tabula
you can find it at.
It is meant for extracting tables out of pdfs by manually delineating the boundaries of the table. But running on a pdf with no tables would output text with some formatting retained.
There is some automation, although, limited. Refer
https://github.com/tabulapdf/tabula-extractor/wiki/Using-the-command-line-tabula-extractor-tool
Also, may not entirely relevant here, you can use openrefine
to manage messy data. Refer
Upvotes: 1