Reputation: 15
(I know that pdfplumber is mainly geared towards computer-generated PDFs. However, before I spend a couple of days handtyping data from my scanned PDFs, I thought I'd ask if pdfplumber could somehow help me.)
My problem:
I have scanned PDFs from historical books.
Example: Data from statistical yearbook
Now I'm trying to extract the table (the one in the lower-right in the example) from the scanned PDF.
My first attempts at extracting the table with pdfplumber didn't work.
e.g.
with pdfplumber.open('test.pdf') as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
print(tables)
returned None
Is there any hope that I will be able to extract this kind of data non-manually? Or should I just suck it up?
Thanks in advance for any help or advice!
Upvotes: 0
Views: 3462
Reputation: 1044
No, a scanned pdf contains actually an image inside. You can read the image as shown below but that will not help you to get the data. You could get the data using some tools that can analyze the image, but that's a ifferent story.
from pikepdf import Pdf, PdfImage
filename = "sample-in.pdf"
example = Pdf.open(filename)
for i, page in enumerate(example.pages):
for j, (name, raw_image) in enumerate(page.images.items()):
image = PdfImage(raw_image)
out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")
Also this question can help you understand what and how to use if it's mandatory for you to get that data
Upvotes: 0