willian12345
willian12345

Reputation: 23

Error when trying to extract text from PDF using Python PDFMINER

I'm trying to extract text from a PDF using Python's PDFMINER, but when I run the script below, I'm getting the error:

Traceback (most recent call last): from pdfminer.high_level import extract_pages ImportError: cannot import name 'extract_pages' from 'pdfminer.high_level' (C:\Users\Willian Ambisis\AppData\Local\Programs\Python\Python39\lib\site-packages\pdfminer\high_level.py)

Script:

from pdfminer.high_level import extract_text

text = extract_text('report.pdf')
print(text)

I used this answer to build the script: https://stackoverflow.com/a/61857301/16487962

Upvotes: 0

Views: 1469

Answers (1)

SandunAmarathunga
SandunAmarathunga

Reputation: 109

I recommend you to use Pytesseract and openCV instead. (simple article)

import os
from PIL import Image
from pdf2image import convert_from_path
import pytesseract

filePath = ‘021-DO-YOU-WONDER-ABOUT-RAIN-SNOW-SLEET-AND-HAIL-Free-Childrens-Book-By-Monkey-Pen.pdf’
doc = convert_from_path(filePath)

path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)

for page_number, page_data in enumerate(doc):
txt = pytesseract.image_to_string(page_data).encode(“utf-8”)
print(“Page # {} — {}”.format(str(page_number),txt))

Upvotes: 0

Related Questions