Speed up the search of substring inside big string

Question

I have written the following python snippet of code which actually finds if a specific substring exists inside a string. Due to the fact that the loop is running around 1000 times, it takes around 5-7 sec to complete.

for style in all_available_gs_styles:
     if style.sld_title is not None:
       if str(style.sld_title) not in ('line', 'point', 'polygon', 'Polygon', 'Default Line', 'Default Point'):
         if 'PolygonSymbolizer' in style.sld_body and layer_geom == 'polygon':
            gs_styles.append((style.name, style.sld_title))
         elif 'LineSymbolizer' in style.sld_body and layer_geom == 'line':
            gs_styles.append((style.name, style.sld_title))
         elif 'PointSymbolizer' in style.sld_body and layer_geom == 'point':
            gs_styles.append((style.name, style.sld_title))

I was wondering if there is a more efficient way to search for a string inside a text which is around 50 lines long. What would be a quicker approach?

EDIT Following the accepted answer, the time of execution was reduced to 4-5 seconds. Still not sufficient but better than before.

Ma0 · Accepted Answer

I would go with something more compact but still fairly readable like this:

geoms   = ('line', 'point', 'polygon')  # see EDIT
invalid = {'line', 'point', 'polygon', 'Polygon', 'Default Line', 'Default Point'}
for style in all_available_gs_styles:
    if style.sld_title and str(style.sld_title) not in invalid:
        if any(layer_geom == x and '{}Symbolizer'.format(x.capitalize()) in style.sld_body for x in geoms):
            gs_styles.append((style.name, style.sld_title))

Note that the gains are only conditional:

E.g., checking for equality check first because it is faster is the right way to go but it will only help in cases it returns False.

The bottleneck in your code (and mine) are these in checks (if 'PolygonSymbolizer' in style.sld_body ) but without knowing the data you are working with I cannot help any further.

EDIT

Using Euler's formula for polyhedra we can assume that for every polygon, the number of lines (E) is going to be greater than the number of vertices (V) and as such, the most frequent entity in style.sld_body. We can take advantage of that to make the any short-circuit more-often by re-arranging the geom tuple like geoms = ('line', 'point', 'polygon'). This will of course not have a significant impact but it is the best we can do.

Speed up the search of substring inside big string

Answers (2)

Related Questions