Tototulbi
Tototulbi

Reputation: 15

Can pdfplumber extract tables for my scanned pdfs?

(I know that pdfplumber is mainly geared towards computer-generated PDFs. However, before I spend a couple of days handtyping data from my scanned PDFs, I thought I'd ask if pdfplumber could somehow help me.)

My problem:
I have scanned PDFs from historical books.
Example: Data from statistical yearbook
Now I'm trying to extract the table (the one in the lower-right in the example) from the scanned PDF.

My first attempts at extracting the table with pdfplumber didn't work.
e.g.

with pdfplumber.open('test.pdf') as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
print(tables)

returned None

Is there any hope that I will be able to extract this kind of data non-manually? Or should I just suck it up?

Thanks in advance for any help or advice!

Upvotes: 0

Views: 3462

Answers (1)

Alexandru DuDu
Alexandru DuDu

Reputation: 1044

No, a scanned pdf contains actually an image inside. You can read the image as shown below but that will not help you to get the data. You could get the data using some tools that can analyze the image, but that's a ifferent story.

from pikepdf import Pdf, PdfImage

filename = "sample-in.pdf"
example = Pdf.open(filename)

for i, page in enumerate(example.pages):
    for j, (name, raw_image) in enumerate(page.images.items()):
        image = PdfImage(raw_image)
        out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")

Also this question can help you understand what and how to use if it's mandatory for you to get that data

Upvotes: 0

Related Questions