Reputation: 2785
I came across a great library called Tabula and it almost did the trick. Unfortunately, there is a lot of useless area on the first page that I don't want Tabula to extract. According to documentation, you can specify the page area you want to extract from. However, the useless area is only on the first page of my PDF file, and thus, for all subsequent pages, Tabula will miss the top section. Is there a way to specify the area condition to only apply to the first page of the PDF?
from tabula import read_pdf
df = read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages='all')
Upvotes: 2
Views: 37803
Reputation: 1343
pip install tabula-py
pip install tabulate
#reads table from pdf file
df = read_pdf("abc.pdf", pages=[2:]) #address of pdf file
print(tabulate(df))
Parameters:
pages (str, int, list of int, optional)
An optional values specifying pages to extract from. It allows str,int
, list of :int. Default: 1
Examples
'1-2,3', 'all', [1,2]
since the first page is useless dropping first page and reading upto last page
Upvotes: 0
Reputation: 1324
Use the below code ! It may help you !!!
import os
os.path.abspath("E:/Documents/myPy/")
from tabula import wrapper
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all')
i=1
for table in tables:
table.to_excel('output'+str(i)+'.xlsx',index=False)
print(i)
i=i+1
Upvotes: 2
Reputation: 81
I'm trying to work on something similar (parsing bank statements) and had the same issue. The only way to solve this I have found so far is to parse each page individually.
The only problem is that this requires to know in advance how many pages your file is composed of. For the moment I have not found a how to do this directly with Tabula, so I've decided to use the pyPdf module to get the number of pages.
import pyPdf
from tabula import read_pdf
reader = pyPdf.PdfFileReader(open("C:\Users\riley\Desktop\Bank Statements\50340.pdf", mode='rb' ))
n = reader.getNumPages()
df = []
for page in [str(i+1) for i in range(n)]:
if page == "1":
df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages=page))
else:
df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", pages=page))
Notice that there are some known and open issues when reading each page individually, or all at the same time.
Good luck!
08/03/2017 EDIT:
Found a simpler way to count the pages of the pdf without going through pyPDf
import re
def count_pdf_pages(file_path):
rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL)
with open(file_path, "rb") as temp_file:
return len(rxcountpages.findall(temp_file.read()))
where file_path is the path to your file of course
Upvotes: 6