Baktaawar
Baktaawar

Reputation: 7490

Installing Poppler for PDF text extraction

I am trying to follow this blog in trying to extract text from an invoice pdf file. My text extraction requires extraction specific fields of the invoice.

https://kaijento.github.io/2017/03/27/pdf-scraping-gwinnetttaxcommissioner.publicaccessnow.com/#pdftotext

I have tried pdfminer, textract but they all extract the text as jumbled and its difficult to extract text after that.

I came across Poppler package download below:

https://poppler.freedesktop.org/releases.html

Looks like its a .tar file. And not a python package.

Am not sure how to use this .tar file to extract the package and use it in Python.

Any suggestions how I install this on my mac and then use it programatically in python to run a bunch of pdf files through this to extract data.

Upvotes: 3

Views: 9727

Answers (3)

Akoffice
Akoffice

Reputation: 381

Steps to Install poppler in Ubuntu:

sudo apt-get install libpoppler-cpp-dev

pip install --use-pep517 .

Upvotes: 0

Roland Smith
Roland Smith

Reputation: 43505

Use subprocess to call the pdftotext program from the xpdf tools. You can find ms-windows versions of those tools at https://www.xpdfreader.com/download.html. Get the "Xpdf command line tools".

I use it like this (python 3.7):

import subprocess as sp

def pdftotext(path):
    """
    Generate a text rendering of a PDF file in the form of a list of lines.
    """
    args = ['pdftotext', '-layout', path, '-']
    cp = sp.run(
      args, stdout=sp.PIPE, stderr=sp.DEVNULL,
      check=True, text=True
    )
    return cp.stdout

Upvotes: 3

Vidyadhar Rao
Vidyadhar Rao

Reputation: 333

You can try poppler for python here: https://pypi.org/project/python-poppler-qt5/

Upvotes: 0

Related Questions