Reputation: 11
I have a lot of pptx files to search in a directory and I am looking for specific word "data" in these files. I created the below code which reads all the files but it does not provide the correct result of true or false. For example in Person1.pptx
the word "data" exists in two "shapes". The question is where is exactly the mistake and why the code have incorrect results.
from pptx import Presentation
import os
files = [x for x in os.listdir("C:/Users/../Desktop/Test") if x.endswith(".pptx")]
for eachfile in files:
prs = Presentation("C:/Users/.../Desktop/Test/" + eachfile)
print(eachfile)
print("----------------------")
for slide in prs.slides:
for shape in slide.shapes:
print ("Exist? " + str(hasattr(shape, 'data')))
The result is as below
Person1.pptx
----------------------
Exist? False
Exist? False
Exist? False
Exist? False
Exist? False
Exist? False
Exist? False
Exist? False
Person2.pptx
----------------------
Exist? False
Exist? False
Exist? False
Exist? False
Exist? False
Exist? False
Exist? False
Exist? False
Exist? False
Exist? False
Exist? False
And the expected result would be to find in one of the slides the word "data" and print true. Actually the expected result would be:
Person1.pptx
----------------------
Exist? True
Person1.pptx
----------------------
Exist? False
True if in any of the shapes in each slide the word exists and false if in all shapes of the slide the word does not exist.
Upvotes: 0
Views: 2019
Reputation: 167
Answering this since the above answer might mislead more than me. It is not complete. It is neither wrong. But it will in many real life cases deliver the wrong result.
The issue is that it is ignoring that there are a number of structures to parse. Above code parses only some of these (the shapes with text directly in themselves). The most important structure which also need to be parsed to find all shapes with the text wanted, is the group. This is a shape which in itself may not contain the text, but may contain shapes containing the text.
Also, this group shape or its shapes may in turn contain other groups. This lead us to a need for an iterative search strategy. Thus, a different approach is needed when parsing the shapes in each slide. This is best shown by reusing above code, keeping the first part:
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE
import os
files = [x for x in os.listdir("C:/Users/.../Desktop/Test") if x.endswith(".pptx")]
for eachfile in files:
prs = Presentation("C:/Users/.../Desktop/Test/" + eachfile)
for slide in prs.slides:
then we need to replace the "hasattr" test with a call for the recursive part:
checkrecursivelyfortext(slide.shapes)
and also insert a new recursive function definition of the function (like after the import statement). To make comparison easier, the inserted function is using the same code as above, only adding the recursive part:
def checkrecursivelyfortext(shpthissetofshapes):
for shape in shpthissetofshapes:
if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
checkrecursivelyfortext(shape.shapes)
else:
if hasattr(shape, "text"):
shape.text = shape.text.lower()
if "whatever_you_are_looking_for" in shape.text:
print(eachfile)
print("----------------------")
break
To work exactly as wanted, the break need to be handled differently (breaking all ongoing loops). This would complicate the code a bit and miss the focus on the parsing of groups, thus ignored here.
Upvotes: 1
Reputation: 11
I found it by myself. :)
from pptx import Presentation
import os
files = [x for x in os.listdir("C:/Users/.../Desktop/Test") if x.endswith(".pptx")]
for eachfile in files:
prs = Presentation("C:/Users/.../Desktop/Test/" + eachfile)
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
shape.text = shape.text.lower()
if "whatever_you_are_looking_for" in shape.text:
print(eachfile)
print("----------------------")
break
Upvotes: 1