I'm working on a project that requires obtaining data from some PDF documents. Currently I'm using Foxit toolkit (calling it from the script) to convert the document to txt and then I iterate through it. I'm pretty happy with it, but 100$ it's just something I can't afford for such a small project. I've tested all the free converters that I could find (like xpdf , pdftotext ) but they just don't cut it, they mess up the format in a way that i cant use the words to locate the data. I've tried some Python modules like pdfminer but they don't seem to work well in Python 3 . I can't get the data before it's converted to PDF because I get them from a phone carrier. I'm looking for a way of getting the data from the PDF or a converter that at least follow the newlines properly. Update: PyPDF2 is not grabbing any text whatsoever from the pdf document.

Reputation: 347

Python 3 - Data mining from PDF

I'm working on a project that requires obtaining data from some PDF documents.

Currently I'm using Foxit toolkit (calling it from the script) to convert the document to txt and then I iterate through it. I'm pretty happy with it, but 100$ it's just something I can't afford for such a small project.

I've tested all the free converters that I could find (like xpdf, pdftotext) but they just don't cut it, they mess up the format in a way that i cant use the words to locate the data.
I've tried some Python modules like pdfminer but they don't seem to work well in Python 3.
I can't get the data before it's converted to PDF because I get them from a phone carrier.

I'm looking for a way of getting the data from the PDF or a converter that at least follow the newlines properly.

Update: PyPDF2 is not grabbing any text whatsoever from the pdf document.

Upvotes: 4

Answers (4)

opticaliqlusion

Reputation: 337

I had the same problem when I wanted to do some deep inspection of PDFs for security analysis - I had to write my own utility that parses the low-level objects and literals, unpacks streams, etc so I could get at the "raw data":

https://github.com/opticaliqlusion/pypdf

It's not a feature complete solution, but it is meant to be used in a pure python context where you can define your own visitors to iterate over all the streams, text, id nodes, etc in the PDF tree:

class StreamIterator(PdfTreeVisitor):
    '''For deflating (not crossing) the streams'''
    def visit_stream(self, node):
        print(node.value)
        pass
...
StreamIterator().visit(tree)

Anyhow, I dont know if this is the kind of thing you were looking for, but I used it to do some security analysis when looking at suspicious email attachments.

Cheers!

Upvotes: 0

dfranca

Reputation: 5322

The PyPDF2 seems to be the best one available for Python3 It's well documented and the API is simple to use.

It also can work with encrypted files, retrieve metadata, merge documents, etc

A simple use case for extracting the text:

from PyPDF2 import PdfFileReader
with open("test.pdf",'rb') as f:
    if f:
        ipdf = PdfFileReader(f)
        text = [p.extractText() for p in ipdf.pages]

Upvotes: 3

taufikedys

Reputation: 339

Here is an example of pyPDF2 codes:

from PyPDF2 import PdfFileReader

pdfFileObj = open("FileName", "rb")
pdfReader  = PdfFileReader(pdfFileObj,strict = False)
data=[page.extractText() for page in pdfReader.pages]

more information on pyPDF2 here.

Upvotes: 1

T0m

Reputation: 153

I don't believe that there is a good free python pdf converter sadly, however pdf2html although it is not a python module, works extremely well and provides you with much more structured data(html) compared to a simple text file. And from there you can use python tools such as beautiful soup to scrape the html file.

link - http://coolwanglu.github.io/pdf2htmlEX/

Hope this helps.

Upvotes: 1

Python 3 - Data mining from PDF

Answers (4)

Related Questions