Reputation: 803
I am looking for a solution to extract both text and tables out of a PDF file. While some packages are good for extracting text, they are not enough good to extract tables.
One solution would be using Azure Form Recognizer Layout Model, but it fails when we have a mix of text and table, in particular when tables are kind of text format and they mix contents of tables and text together (please see Azure Form Recognizer code https://github.com/Azure-Samples/cognitive-services-quickstart-code/blob/master/python/FormRecognizer/rest/python-train-extract.md).
I tried pypdf2 and pdfplumber as well; here is the code for pypdf2:
import PyPDF2
data_path = "directory/to/pdf/files"
texts = []
for fp in os.listdir(data_path):
pdfFileObj = open(os.path.join(data_path, fp), 'rb')
print(pdfFileObj)
#
pdfreader=PyPDF2.PdfFileReader(pdfFileObj)
#
count=pdfreader.numPages
#
text = " "
for i in range(count):
page = pdfreader.getPage(i)
text += page.extractText()
texts.extend([text])
First, pypdf2 works not bad for some pdf files, but it fails and does not preserve spaces between words for some pdfs like (pdf file from https://www.researchgate.net/publication/342920307_Using_Topic_Modeling_Methods_for_Short-Text_Data_A_Comparative_Analysis):
Second how I can extract tables if exist in a page? pdfplumber can extract both text and tables using extract_text()
and extract_table()
methods. It fails in preserving spaces between words for some documents. It also fails when we have double column pdf files as experienced.
Tabula is another alternative, but good with tables as I see from their website https://github.com/tabulapdf/tabula. My end question is what is the best practices to extract both contents, text and tables, out of pdf files given single column or double column pages.
Upvotes: 9
Views: 44039
Reputation: 1185
The answer depends if the question is general or specific to a single form. Your approach is reasonable for the general case, but there will be variability. If you have a pdf form that is a single form or report that has been created with different data at each iteration consider converting the form from pdf to postscript then see if you can parse the postscript.
Two utilities do this: pdf2ps and pdftops Try each. This approach may benefit if you know some postscript. With some luck the needed fields may be simple text strings. Worth a try.
Upvotes: 1