Ecko
Ecko

Reputation: 115

Partially searchable pdf document

I'm using tabula library to read each pdf. In each pdf there is a table with its headers (columns) and its corresponding information. It all worked perfectly except for the last pdf. code:

import tabula

read_pdf(path, pages = "2",  multiple_tables = False, 
output_format = 'dataframe', pandas_options ={header: None})

part of the dataframe output (example):

nan SBI nan nan nan nan nan nan nan nan nan nan
JKL1LU1UKDAO/ /O/NEPLW45WF3CKL  AF HSF1P PUAVKM RO0SA OSOAEAUMM5M31/6 PO LLŠF
KLMIMOG 0TLSL P0EK RV V OKŠGVJAVUAMNAWA ACADFUIF S JN FKFKLLLGLDAA2F LEV KA OTIF 2A4 KACNATULO01F2NVSCFRE  BB AG05ANJA OLE4CPIVL1SGA 2AFK MR0HASET2PMG MLIONEKO0KF 0IEOJB1 L E NECGCVL1GXLDA 7019N8BVPV90

It is def. not the code since I tried even the web-based tabula link: https://tabula.technology/ where you can specify the aspect ratio (so as in the code that I used as well) and it just sometimes recognizes a word or character.

Seems like it has to do with the way the pdf table got constructed in the pdf. When I hit the edit in the pdf I can see bunch of text boxes sometimes with junk of texts as a group sometimes they are separate letters, words, etc.

There is also some sort of hidden layer - information - on some part of the pages.

Even after cropping specific parts, deleting metadata, hidden and overlapping objects then exporting it to pdf again (in adobe reader) when I try loading the pdf, problem remains.

The only way I could get the right text from the pdf is to scrape only the text with the following lib and code:

import fitz

text = ""
path = "file.pdf"

doc = fitz.open(path)
for page in doc:
    text += page.getText()
  

This gives me exactly as it is in the pdf but this is far from the dataframe, meaning that it will take pretty long to pre-process it data clean it, and parse it in the right format in order to ultimately get the desired dataframe, that should be possible to do directly with tabula.

tried two more libraries: pyPDF2 and pdfMiner both produce string outputs, which will require long way to preprocess it.

from pdfminer.high_level import extract_text

text = extract_text(path.pdf)

Thus, my question would be:

  1. what would be the best practice approach here. Should I try transforming the pdf to a fully-searchable text? If so what would be the most pythonic way?
  2. trying to crop outside of python seems rookie approach where I'm cropping and deleting things to get the aspect ratio and getting rid of some data. Must be a way to access all this information in order to get a dataframe

The main idea is to read the pdf as it is and reproduce actually to get the tables in a dataframe to be able to manipulate with it. Any suggestions are welcome.

Thanks in advance!

Upvotes: 0

Views: 132

Answers (1)

Ecko
Ecko

Reputation: 115

The solution to extract table from a partially searchable pdf files is to use the feature of OCR in the adobe reader. After that tabula is able to read and extract it actually.

Upvotes: 1

Related Questions