Reputation: 15
I am trying to extract a PDF file's "Content" page (e.g. page 2) as a table and come up with a dataframe which tracks the items vs its corresponding starting page number. Some suggested using "Tabula". I tried a few lines but seems to get error either not finding the read_pdf module or getting an empty dataframe. Appreciate any help to get it to work?
from tabula import wrapper
myfile='http://www.hkexnews.hk/listedco/listconews/SEHK/2017/0410/LTN201704101126_C.pdf'
df = wrapper.read_pdf(myfile)
Upvotes: 1
Views: 3344
Reputation: 116
from tabula import read_pdf
File = "ArchivedResults/LTN201704101126_C.pdf"
df = read_pdf(File, pages=2, guess=False, columns=(248, 385))
print(df)
Unnamed: 0 目錄
0 2 公司資料
1 3 財務概要
2 4 主席報告
3 11 管理層討論及分析
4 27 董事會報告
5 66 企業管治報告
6 86 環境、社會及管治報告
7 100 獨立核數師報告
8 109 綜合收益表
9 110 綜合全面收益表
10 111 綜合財務狀況表
11 114 綜合權益變動表
12 116 綜合現金流量表
13 118 綜合財務報表附註
14 227 釋義
Tabula seems to have issues finding a table when you only have 2 columns. The solution is to turn off the auto table finding option (guess=False), and then specify where the columns should be (Note that you only specify the space between columns, but you must specify at least 2 so I set the second column separator to an arbitrary distance past your last column). Some users may need to specify and area (area=(top,left,bottom,right)), but for your example that wasn't necessary.
Upvotes: 2