Reputation: 3961
I am using Python (3.8.1) and tabula-py (2.1.0) (https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.build_options) to extract tables from a text based PDF file (Monthly AWS billing report).
Below a sample of the PDF file is shown (bottom of 1st page and top of 2nd page).
The Python script is shown below:
from tabula import read_pdf
from tabulate import tabulate
df = read_pdf(
"my_report.pdf",
output_format="dataframe",
multiple_tables=True,
pages="all",
silent=True,
# TODO: area = (x_left, x_right, y_left, y_right) # ?
)
print(tabulate(df))
Which generates the following output:
--- --------------------------------------------------------------------------- --------------------- ---------
0 region nan nan
1 AWS CloudTrail APS2-PaidEventsRecorded nan $3.70
2 0.00002 per paid event recorded in Asia Pacific (Sydney) 184,961.000 Events $3.70
3 region nan nan
4 Asia Pacific (Tokyo) nan $3.20
My thought is that the area option has to be properly set, since the top- and the left-most data is sometimes omitted. Is this the case, and if so, how do you find the correct area of all tabular data within the PDF file?
Thanks in advance.
Upvotes: 0
Views: 3328
Reputation: 3961
I managed to solve this issue by extending the location of the data being searched:
# get locations from page 2 data:
tables = read_pdf("my_report.pdf", output_format="json", pages=2, silent=True)
top = tables[0]["top"]
left = tables[0]["left"]
bottom = tables[0]["height"] + top
right = tables[0]["width"] + left
# Expand location borders slightly:
test_area = [top - 20, left - 20, bottom + 10, right + 10]
# Now read_pdf gives all data with the following call:
df = read_pdf(
"my_report.pdf",
multiple_tables=True,
pages="all",
silent=True,
area = test_area
)
Upvotes: 0