Reputation: 131

Tabula-py omitting pages from a PDF document I am trying to extract

I am trying to extract tables from a multi-page PDF with tabula-py, and while the tables on some of the pages of the PDF are extracted perfectly, some pages are omitted entirely.

The omissions seem to be random and don't follow any visible visual features on the PDF (as each page looks the same), and so tabula omitted page 1, extracted page 2, omitted pages 3 and 4, extracted page 5, omitted page 6, extracted pages 8 and 9, omitted 10, extracted 11, etc. I have macOS Sierra 10.12.6 and Python 3.6.3 :: Anaconda custom (64-bit).

I've tried splitting the PDF into shorter sections, even into one-pagers, but the pages that are omitted don't seem to be possible to extract no matter what I've tried. I've read the related documentation and filed issues on the Tabula-py GitHub page as well as here on Stack Overflow, but I don't seem to find a solution.

The code I use through iPython notebooks is as follows:

To install tabula through the terminal:

pip install tabula-py

To extract the tables in my PDF:

from tabula import read_pdf
df = read_pdf("document_name.pdf", pages="all")

I also tried the following, which didn't make any difference

df = read_pdf("document_name", pages="1-361")

To save the data frame into csv:

df.to_csv('document_name.csv')

I'd be really thankful if you could help me with this, as I feel like I'm stuck with a PDF from which I've only managed to extract around 50% of data. This is infuriating, as the 50% looks absolutely perfect, but the other 50% seems out of my reach and renders the larger project of analyzing the data impossible.

I also wonder if this might be an issue of the PDF rather than Tabula - could the file be mistakenly set as protected or locked and whether any of you knows how I could check for that and open it up?

Thanks a ton in advance!

Upvotes: 3

Answers (2)

Gustav Rasmussen

Reputation: 3961

This could be because the area of your data in the PDF file exceeds the area that is being read by tabula. Try the following:

First get the location of your data, by parsing one of the pages into JSON format (here I chose page 2), then extract and print the locations:

tables = read_pdf("document_name.pdf", output_format="json", pages=2, silent=True)
top = tables[0]["top"]
left = tables[0]["left"]
bottom = tables[0]["height"] + top
right = tables[0]["width"] + left
print(f"{top=}\n{bottom=}\n{left=}\n{right=}")

You can now try to expand these locations slightly by experimentation, until you receive more data from the PDF document:

# area = [top, left, bottom, right]
# Example from page 2 json output: area = [30.0, 59.0, 761.0, 491.0]
# You could then nudge these locations slightly to include a wider data area:
test_area = [10.0, 30.0, 770.0, 500.0]

df = read_pdf(
    "document_name.pdf",
    multiple_tables=True,
    pages="all",
    area=test_area,
    silent=True,  # Suppress all stderr output
)

and the df variable will now hold your tables with the PDF data.

Upvotes: 1

chezou

Reputation: 495

Try to use java_options like: java_options="-Xmx4g"

Upvotes: 0

Tabula-py omitting pages from a PDF document I am trying to extract

Answers (2)

Related Questions