Vasilieva Polina
Vasilieva Polina

Reputation: 11

Tabula-py doesn't recognise columns correct

I am trying to recognise pdf document using tabula. I use this code:

df = tabula.read_pdf(io.BytesIO(content), pages=12,pandas_options={'header': None}, multiple_tables = True,columns=(78.39, 226.97, 280.97,370.04,461.02,550.06))

However, after recognition, the first two columns are one single column. I tried to change columns coordinates, but it didn't help.

Also, I tried to use guess=False, and it doesn't help too.

I was wondering if anyone could help me with this issue? Many Thanks

Upvotes: 1

Views: 2189

Answers (1)

Deepak Garud
Deepak Garud

Reputation: 1129

Open PDF with SumatraPDF reader. Press ‘m’ to activate measurement display on top left. Then position cursor over top left and bottom right of table to get coordinates as below:

enter image description here

Bottom right:

enter image description here

  1. Run command :

java -jar tabula-1.0.2-jar-with-dependencies.jar -p 2 -a 164,20,390,771 "myPdf.pdf" -o outfile.csv

Note: a) option ‘p’ gives page number

b) option ‘a’ has area of table (top,left,bottom,right) – coordinates got from SumatraPDF reader.

c) "[DEMO USE ONLY] Create_Opp_1822018_111526_AM - signed.pdf" is pdf to extract from

d) Option ‘o’ gives filename to save to. Delete this file if existing before running tabula command.

This will create csv

Upvotes: 2

Related Questions