Reputation: 115
I'm using tabula library to read each pdf. In each pdf there is a table with its headers (columns) and its corresponding information. It all worked perfectly except for the last pdf. code:
import tabula
read_pdf(path, pages = "2", multiple_tables = False,
output_format = 'dataframe', pandas_options ={header: None})
part of the dataframe output (example):
nan SBI nan nan nan nan nan nan nan nan nan nan
JKL1LU1UKDAO/ /O/NEPLW45WF3CKL AF HSF1P PUAVKM RO0SA OSOAEAUMM5M31/6 PO LLŠF
KLMIMOG 0TLSL P0EK RV V OKŠGVJAVUAMNAWA ACADFUIF S JN FKFKLLLGLDAA2F LEV KA OTIF 2A4 KACNATULO01F2NVSCFRE BB AG05ANJA OLE4CPIVL1SGA 2AFK MR0HASET2PMG MLIONEKO0KF 0IEOJB1 L E NECGCVL1GXLDA 7019N8BVPV90
It is def. not the code since I tried even the web-based tabula link: https://tabula.technology/ where you can specify the aspect ratio (so as in the code that I used as well) and it just sometimes recognizes a word or character.
Seems like it has to do with the way the pdf table got constructed in the pdf. When I hit the edit in the pdf I can see bunch of text boxes sometimes with junk of texts as a group sometimes they are separate letters, words, etc.
There is also some sort of hidden layer - information - on some part of the pages.
Even after cropping specific parts, deleting metadata, hidden and overlapping objects then exporting it to pdf again (in adobe reader) when I try loading the pdf, problem remains.
The only way I could get the right text from the pdf is to scrape only the text with the following lib and code:
import fitz
text = ""
path = "file.pdf"
doc = fitz.open(path)
for page in doc:
text += page.getText()
This gives me exactly as it is in the pdf but this is far from the dataframe, meaning that it will take pretty long to pre-process it data clean it, and parse it in the right format in order to ultimately get the desired dataframe, that should be possible to do directly with tabula.
tried two more libraries: pyPDF2 and pdfMiner both produce string outputs, which will require long way to preprocess it.
from pdfminer.high_level import extract_text
text = extract_text(path.pdf)
Thus, my question would be:
The main idea is to read the pdf as it is and reproduce actually to get the tables in a dataframe to be able to manipulate with it. Any suggestions are welcome.
Thanks in advance!
Upvotes: 0
Views: 132
Reputation: 115
The solution to extract table from a partially searchable pdf files is to use the feature of OCR in the adobe reader. After that tabula is able to read and extract it actually.
Upvotes: 1