Reputation: 2785

Extracting Tables from PDFs Using Tabula

I came across a great library called Tabula and it almost did the trick. Unfortunately, there is a lot of useless area on the first page that I don't want Tabula to extract. According to documentation, you can specify the page area you want to extract from. However, the useless area is only on the first page of my PDF file, and thus, for all subsequent pages, Tabula will miss the top section. Is there a way to specify the area condition to only apply to the first page of the PDF?

from tabula import read_pdf

df = read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages='all')

Upvotes: 2

Answers (4)

dataninsight

Reputation: 1343

Extracting Tables from PDFs Using Tabula

pip install tabula-py
pip install tabulate
#reads table from pdf file
df = read_pdf("abc.pdf", pages=[2:]) #address of pdf file
print(tabulate(df))

Parameters:

pages (str, int, list of int, optional) An optional values specifying pages to extract from. It allows str,int, list of :int. Default: 1

Examples

'1-2,3', 'all', [1,2]

since the first page is useless dropping first page and reading upto last page

Upvotes: 0

mikhael

Reputation: 11

parameter'guess=False' will solve the problem.

Upvotes: 1

Parvathirajan Natarajan

Reputation: 1324

Use the below code ! It may help you !!!

import os
os.path.abspath("E:/Documents/myPy/")
from tabula import wrapper
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all')

i=1
for table in tables:
    table.to_excel('output'+str(i)+'.xlsx',index=False)
    print(i)
    i=i+1

Upvotes: 2

DavidVFF

Reputation: 81

I'm trying to work on something similar (parsing bank statements) and had the same issue. The only way to solve this I have found so far is to parse each page individually.

The only problem is that this requires to know in advance how many pages your file is composed of. For the moment I have not found a how to do this directly with Tabula, so I've decided to use the pyPdf module to get the number of pages.

import pyPdf
from tabula import read_pdf

reader = pyPdf.PdfFileReader(open("C:\Users\riley\Desktop\Bank Statements\50340.pdf", mode='rb' ))
n = reader.getNumPages() 

df = []
for page in [str(i+1) for i in range(n)]:
    if page == "1":
            df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages=page))
    else:
            df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", pages=page))

Notice that there are some known and open issues when reading each page individually, or all at the same time.

Good luck!

08/03/2017 EDIT:

Found a simpler way to count the pages of the pdf without going through pyPDf

import re
def count_pdf_pages(file_path):
    rxcountpages = re.compile(r"/Type\s*/Page([^s]|$)", re.MULTILINE|re.DOTALL)
    with open(file_path, "rb") as temp_file:
        return len(rxcountpages.findall(temp_file.read()))

where file_path is the path to your file of course

Upvotes: 6

Extracting Tables from PDFs Using Tabula

Answers (4)

Extracting Tables from PDFs Using Tabula

Related Questions