Reputation: 23
I'm trying to extract text from a PDF using Python's PDFMINER, but when I run the script below, I'm getting the error:
Traceback (most recent call last): from pdfminer.high_level import extract_pages ImportError: cannot import name 'extract_pages' from 'pdfminer.high_level' (C:\Users\Willian Ambisis\AppData\Local\Programs\Python\Python39\lib\site-packages\pdfminer\high_level.py)
Script:
from pdfminer.high_level import extract_text
text = extract_text('report.pdf')
print(text)
I used this answer to build the script: https://stackoverflow.com/a/61857301/16487962
Upvotes: 0
Views: 1469
Reputation: 109
I recommend you to use Pytesseract and openCV instead. (simple article)
import os
from PIL import Image
from pdf2image import convert_from_path
import pytesseract
filePath = ‘021-DO-YOU-WONDER-ABOUT-RAIN-SNOW-SLEET-AND-HAIL-Free-Childrens-Book-By-Monkey-Pen.pdf’
doc = convert_from_path(filePath)
path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)
for page_number, page_data in enumerate(doc):
txt = pytesseract.image_to_string(page_data).encode(“utf-8”)
print(“Page # {} — {}”.format(str(page_number),txt))
Upvotes: 0