jukhamil
jukhamil

Reputation: 819

How to iterate with index elegantly in Python, for string-matching a list of substrings inside text?

Text is a large string, patterns is a list of short (fixed) strings. My code is:

def BruteForcePatternMatching(text, patterns):
    indices = []
    for pattern in patterns:
        for index in range(len(text) - 1):
            slide = text[index : index + len(pattern) - 1]
            if pattern == slide:
                indices.append(index)
    return indices

My question is if there is a 'pythonic' way to extract a 'slide' from text, of iterated size.

Upvotes: 0

Views: 259

Answers (2)

smci
smci

Reputation: 33950

You're trying to find matches to a list, patterns, inside your text.

If patterns is just fixed strings, use string.find() (although note that will only find the first occurrence - see below).

s = 'cat dog cow dog' # Note 'dog' occurs multiply
s.find('dog')
4
s.find('cow')
8
s.find('cat')
0

More generally, if patterns either has regexes, or has duplicate fixed strings, use re.findall/iter(). See the many duplicate questions here for examples.

import re
pat = re.compile(r'(cat|dog|cow)')
pat.findall("The cat and cow sat on the dog's catalog of doggerel")
# ['cat', 'cow', 'dog', 'cat', 'dog']

And if you also need the indices where matches occur, use re.finditer() as @khachik showed.

Note we combined all the patterns into one regex, then used one re.findall() call to search for all matches of every pattern inside all of text. Your current code is very inefficient: it's trying to test for string-equality by running a sliding-window over text, and has nested loops: one covers all N indices in text, and all P patterns; this will be order O(N*P), which is not scalable.

Upvotes: 2

khachik
khachik

Reputation: 28693

Assuming your patterns don't contain any special regex characters or are regexes themselves:

import re
indices = [(m.start(), m.end()) for m in re.finditer("|".join(patterns), text)]

Upvotes: 1

Related Questions