Reputation: 15

How to extract a PDF table into a dataframe using tabula

I am trying to extract a PDF file's "Content" page (e.g. page 2) as a table and come up with a dataframe which tracks the items vs its corresponding starting page number. Some suggested using "Tabula". I tried a few lines but seems to get error either not finding the read_pdf module or getting an empty dataframe. Appreciate any help to get it to work?

from tabula import wrapper

myfile='http://www.hkexnews.hk/listedco/listconews/SEHK/2017/0410/LTN201704101126_C.pdf'

df = wrapper.read_pdf(myfile)

Upvotes: 1

Answers (1)

Pants

Reputation: 116

from tabula import read_pdf

File = "ArchivedResults/LTN201704101126_C.pdf"

df = read_pdf(File, pages=2, guess=False, columns=(248, 385))
print(df)



    Unnamed: 0          目錄
0            2        公司資料
1            3        財務概要
2            4        主席報告
3           11    管理層討論及分析
4           27       董事會報告
5           66      企業管治報告
6           86  環境、社會及管治報告
7          100     獨立核數師報告
8          109       綜合收益表
9          110     綜合全面收益表
10         111     綜合財務狀況表
11         114     綜合權益變動表
12         116     綜合現金流量表
13         118    綜合財務報表附註
14         227          釋義

Tabula seems to have issues finding a table when you only have 2 columns. The solution is to turn off the auto table finding option (guess=False), and then specify where the columns should be (Note that you only specify the space between columns, but you must specify at least 2 so I set the second column separator to an arbitrary distance past your last column). Some users may need to specify and area (area=(top,left,bottom,right)), but for your example that wasn't necessary.

Upvotes: 2

How to extract a PDF table into a dataframe using tabula

Answers (1)

Related Questions