Reputation: 8069

Faster way to print all starting indices of a substring in a string, including overlapping occurences

I'm trying to answer this homework question: Find all occurrences of a pattern in a string. Different occurrences of a substring can overlap with each other.

Sample 1.

Input:

TACG

GT

Output:

Explanation: The pattern is longer than the text and hence has no occurrences in the text.

Sample 2.

Input:

ATA

ATATA

Output:

0 2

Explanation: The pattern appears at positions 1 and 3 (and these two occurrences overlap each other).

Sample 3.

ATAT

GATATATGCATATACTT

Output:

1 3 9

Explanation: The pattern appears at positions 1, 3, and 9 in the text.

The answer I'm submitting is this one:

def all_indices(text, pattern):
    i = text.find(pattern)
    while i >= 0:
        print(i, end=' ')
        i = text.find(pattern, i + 1)


if __name__ == '__main__':
    text = input()
    pattern = input()
    all_indices(text, pattern)

However, this code is failing the final test cases:

Failed case #63/64: time limit exceeded (Time used: 7.98/4.00, memory used: 77647872/536870912.)

The online judge knows I'm sending the answer in Python, and has different time limits for different languages.

I have searched quite a bit for other answers and approaches: regexes, suffix trees, Aho-Corasick... but so far all of them underperform this simple solution (maybe because find is implemented in C?).

So my question is: are there ways to do this task faster?

Upvotes: 0

Answers (2)

Marcus Vinícius Monteiro

Reputation: 8069

I believe that the test cases were being more lenient towards the Knuth-Morris-Pratt algorithm. This code, copied from https://en.wikibooks.org/wiki/Algorithm_Implementation/String_searching/Knuth-Morris-Pratt_pattern_matcher#Python, passed all the cases:

# Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002

#from http://code.activestate.com/recipes/117214/
def KnuthMorrisPratt(text, pattern):

    '''Yields all starting positions of copies of the pattern in the text.
    Calling conventions are similar to string.find, but its arguments can be
    lists or iterators, not just strings, it returns all matches, not just
    the first one, and it does not need the whole text in memory at once.
    Whenever it yields, it will have read the text exactly up to and including
    the match that caused the yield.'''

    # allow indexing into pattern and protect against change during yield
    pattern = list(pattern)

    # build table of shift amounts
    shifts = [1] * (len(pattern) + 1)
    shift = 1
    for pos in range(len(pattern)):
        while shift <= pos and pattern[pos] != pattern[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in text:
        while matchLen == len(pattern) or \
              matchLen >= 0 and pattern[matchLen] != c:
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == len(pattern):
            yield startPos

Upvotes: 0

silel

Reputation: 587

If print is what slows your program the most, you should try to call it as little as possible. A quick and dirty solution to your problem:

def all_indices(string, pattern):
    result = []
    idx = string.find(pattern)
    while idx >= 0:
        result.append(str(idx))
        idx = string.find(pattern, idx + 1)
    return result

if __name__ == '__main__':
    string = input()
    pattern = input()
    ' '.join(all_indices(string, pattern))

In the future to correctly identify which part of your code is slowing down the whole process you can use the python profilers

Upvotes: 1

Faster way to print all starting indices of a substring in a string, including overlapping occurences

Answers (2)

Related Questions