Gustav Rasmussen
Gustav Rasmussen

Reputation: 3961

Tabula-py skips first page from PDF and misses some tabular data

I am using Python (3.8.1) and tabula-py (2.1.0) (https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.build_options) to extract tables from a text based PDF file (Monthly AWS billing report).

Below a sample of the PDF file is shown (bottom of 1st page and top of 2nd page).

PDF sample


The Python script is shown below:

from tabula import read_pdf
from tabulate import tabulate

df = read_pdf(
   "my_report.pdf",
   output_format="dataframe",
   multiple_tables=True,
   pages="all",
   silent=True,
   # TODO: area = (x_left, x_right, y_left, y_right) # ?
)

print(tabulate(df))


Which generates the following output:

---  ---------------------------------------------------------------------------  ---------------------  ---------
  0  region                                                                       nan                    nan
  1  AWS CloudTrail APS2-PaidEventsRecorded                                       nan                    $3.70
  2  0.00002 per paid event recorded in Asia Pacific (Sydney)                     184,961.000 Events     $3.70
  3  region                                                                       nan                    nan
  4  Asia Pacific (Tokyo)                                                         nan                    $3.20

My thought is that the area option has to be properly set, since the top- and the left-most data is sometimes omitted. Is this the case, and if so, how do you find the correct area of all tabular data within the PDF file?

Thanks in advance.

Upvotes: 0

Views: 3328

Answers (2)

John Smith
John Smith

Reputation: 51

Try using param "guess=False".

Upvotes: 4

Gustav Rasmussen
Gustav Rasmussen

Reputation: 3961

I managed to solve this issue by extending the location of the data being searched:

# get locations from page 2 data:
tables = read_pdf("my_report.pdf", output_format="json", pages=2, silent=True)
top = tables[0]["top"]
left = tables[0]["left"]
bottom = tables[0]["height"] + top
right = tables[0]["width"] + left
# Expand location borders slightly:
test_area = [top - 20, left - 20, bottom + 10, right + 10]

# Now read_pdf gives all data with the following call:

df = read_pdf(
   "my_report.pdf",
   multiple_tables=True,
   pages="all",
   silent=True,
   area = test_area
)

Upvotes: 0

Related Questions