Reputation: 21
trying to extract text from image whose type is 'PIL.PpmImagePlugin.PpmImageFile'
using pytesseract
. The code and the error is as below
from pdf2image import convert_from_path
pages = convert_from_path('D:/pdf_csv/HealthCare/eRDS - ML/eRDS - ML/2001468/2001468,69,70.pdf',poppler_path='C:/Users/Hp/poppler-0.68.0/bin')
text = pyt.image_to_string(Image.open(pages[0]), lang='eng')
Error I am getting:
AttributeError: 'PpmImageFile' object has no attribute 'read'
Or Is there any method to convert the PpmImageFile to 'jpg' or 'png' format
Upvotes: 2
Views: 6335
Reputation: 1506
Add fmt='jpeg'
or fmt='png'
to your function call to get non-PPM images from pdf2image.
In you example, change
pages = convert_from_path('D:/pdf_csv/Health....001468,69,70.pdf',poppler_path='C:/Users/Hp/poppler-0.68.0/bin')
to
pages = convert_from_path('D:/pdf_csv/Health...001468,69,70.pdf', fmt='jpeg', poppler_path='C:/Users/Hp/poppler-0.68.0/bin')
Upvotes: 4