Sam S.
Sam S.

Reputation: 803

Extract text and tables of a PDF file in Python

I am looking for a solution to extract both text and tables out of a PDF file. While some packages are good for extracting text, they are not enough good to extract tables.

enter image description here

Upvotes: 9

Views: 44039

Answers (1)

kd4ttc
kd4ttc

Reputation: 1185

The answer depends if the question is general or specific to a single form. Your approach is reasonable for the general case, but there will be variability. If you have a pdf form that is a single form or report that has been created with different data at each iteration consider converting the form from pdf to postscript then see if you can parse the postscript.

Two utilities do this: pdf2ps and pdftops Try each. This approach may benefit if you know some postscript. With some luck the needed fields may be simple text strings. Worth a try.

Upvotes: 1

Related Questions