Chelonian
Chelonian

Reputation: 569

Get image from PowerPoint with Python for OCR

I'm trying to use Python to run OCR with pytesseract on some PowerPoint slides that have images (of text) and I'm stuck on getting the images to pass to pytesseract.

So far, I have this but that last line is the problem:

for slide in presentation.Slides:
    for shape in slide.Shapes:
        if 'Picture' in shape.Name:  #in my case, the images I want have this.
            picture_text = image_to_string(shape)

This gives an error--I guess because a PowerPoint Shape is not an image:

Traceback (most recent call last):
  File "C:/Users/agent/Desktop/Chaelon Stuff on Desktop/Walpole/make_Word_rough_pass_from_PowerPoint_chapter.py", line 61, in <module>
    worddoc.Content.Text = image_to_string(shape)
  File "C:\Python27\lib\site-packages\pytesseract\pytesseract.py", line 143, in image_to_string
    if len(image.split()) == 4:
  File "C:\Python27\lib\site-packages\win32com\client\dynamic.py", line 522, in __getattr__
    raise AttributeError("%s.%s" % (self._username_, attr))
AttributeError: <unknown>.split

So then I tried using shape.Image but get this error:

Traceback (most recent call last):
  File "C:/Users/agent/Desktop/Chaelon Stuff on Desktop/Walpole/make_Word_rough_pass_from_PowerPoint_chapter.py", line 61, in <module>
    worddoc.Content.Text = image_to_string(shape.Image)
  File "C:\Python27\lib\site-packages\win32com\client\dynamic.py", line 522, in __getattr__
    raise AttributeError("%s.%s" % (self._username_, attr))
AttributeError: <unknown>.Image

Given the image is in the presentation, I was hoping there could be some way to get each image from its Shape object and then pass each image directly to pytesseract for OCR (without having to save it to disk as an image first). Is there?

Or do I have to save it to disk as an image and then read it into pytesseract? If so, how best to do that?

Upvotes: 0

Views: 2798

Answers (2)

scanny
scanny

Reputation: 28991

Picture shapes in python-pptx have an image property, which returns an Image object:
http://python-pptx.readthedocs.io/en/latest/api/shapes.html#picture-objects
http://python-pptx.readthedocs.io/en/latest/api/image.html

The image object provides access to the image file bytes and the filename extension (e.g. "png"), which should give you what you need:

for shape in slide.Shapes:
    if 'Picture' in shape.name:
        picture = shape
        image = picture.image
        image_file_bytes = image.blob
        file_extension = image.ext
        # save image as file or perhaps in-memory file like StringIO() using bytes and ext.

Upvotes: 1

user7711283
user7711283

Reputation:

You give yourself the answer to your question, but are not yet sure you are right or just don't want believe it is they way it is. Yes:

You need to save an image to disk as an image and then read it into pytesseract except you find a way to convert the image you got from PowerPoint to an image object used in PIL (Python Image Library).

Maybe someone else can provide here the information how to do the conversion from PowerPoint image to PIL image as I am not on Windows and not using Microsoft PowerPoint to test myself eventually proposed solutions, but maybe THIS link provides already enough information to satisfy your needs:

https://codereview.stackexchange.com/questions/101803/process-powerpoint-xml

Upvotes: 1

Related Questions