Reputation: 11
I am trying to recognise pdf document using tabula. I use this code:
df = tabula.read_pdf(io.BytesIO(content), pages=12,pandas_options={'header': None}, multiple_tables = True,columns=(78.39, 226.97, 280.97,370.04,461.02,550.06))
However, after recognition, the first two columns are one single column. I tried to change columns coordinates, but it didn't help.
Also, I tried to use guess=False, and it doesn't help too.
I was wondering if anyone could help me with this issue? Many Thanks
Upvotes: 1
Views: 2189
Reputation: 1129
Open PDF with SumatraPDF reader. Press ‘m’ to activate measurement display on top left. Then position cursor over top left and bottom right of table to get coordinates as below:
Bottom right:
java -jar tabula-1.0.2-jar-with-dependencies.jar -p 2 -a 164,20,390,771 "myPdf.pdf" -o outfile.csv
Note: a) option ‘p’ gives page number
b) option ‘a’ has area of table (top,left,bottom,right) – coordinates got from SumatraPDF reader.
c) "[DEMO USE ONLY] Create_Opp_1822018_111526_AM - signed.pdf" is pdf to extract from
d) Option ‘o’ gives filename to save to. Delete this file if existing before running tabula command.
This will create csv
Upvotes: 2