Reputation: 7490
I am trying to follow this blog in trying to extract text from an invoice pdf file. My text extraction requires extraction specific fields of the invoice.
I have tried pdfminer, textract but they all extract the text as jumbled and its difficult to extract text after that.
I came across Poppler package download below:
https://poppler.freedesktop.org/releases.html
Looks like its a .tar file. And not a python package.
Am not sure how to use this .tar file to extract the package and use it in Python.
Any suggestions how I install this on my mac and then use it programatically in python to run a bunch of pdf files through this to extract data.
Upvotes: 3
Views: 9727
Reputation: 381
Steps to Install poppler in Ubuntu:
sudo apt-get install libpoppler-cpp-dev
pip install --use-pep517 .
Upvotes: 0
Reputation: 43505
Use subprocess
to call the pdftotext
program from the xpdf tools. You can find ms-windows versions of those tools at https://www.xpdfreader.com/download.html. Get the "Xpdf command line tools".
I use it like this (python 3.7):
import subprocess as sp
def pdftotext(path):
"""
Generate a text rendering of a PDF file in the form of a list of lines.
"""
args = ['pdftotext', '-layout', path, '-']
cp = sp.run(
args, stdout=sp.PIPE, stderr=sp.DEVNULL,
check=True, text=True
)
return cp.stdout
Upvotes: 3
Reputation: 333
You can try poppler for python here: https://pypi.org/project/python-poppler-qt5/
Upvotes: 0