How can I optimize this text-matching function further?

Question

I need to make this function run much faster (~20x faster) to meet a required benchmark. I have made quite a few improvements from my initial implementation, but am hitting a wall.

The basic problem is this: count case-insensitive occurrences of word in text.

Complicating criteria include:

Must be a whole word (word "George" is not found in text "Georges")
Single quotes are to be considered part of words, unless there are more than one in a row
word may actually be a phrase (meaning it could have spaces, punctuation, etc.)
Cannot use regex

My basic implementation has been to loop through each character in text, maintaining my position in word and if the character matches the corresponding character of word, I add it to a local string, advance my position in word and text, and go again. Once I have a match candidate (i.e. my local string is equal to word), I check the surrounding characters to ensure the match candidate is a full word, per rules 1 and 2 above. Note this checking does not occur frequently enough to materially impact the total time the algorithm takes.

Most successful optimizations I have made so far:

Do string lowercasing and length measuring outside the loop
Check that word is at least a substring of text otherwise return 0 immediately
Don't bother checking for full word potential until we have a full match
Count the number of occurrences up front (with no rules), and break out of the loop immediately if we meet that number

I've profiled the code line-by-line using pprofile, and the majority of my code's runtime is simple lines like incrementing the counter var, resetting the match_candidate string to "", indexing into strings, and if statements. I have not included the code for validate_full_match as it is not a significant time user.

Are there any low-hanging fruit I'm ignoring? A different approach entirely I should consider?

Thanks for any suggestions!

def count_occurences_in_text(word, text):
    """Number of occurences of word (case insensitive) in text

    Note that word can actually be any length of text, from a single
    character to a complete phrase; however, partial words do not
    count. For example:
    count_occurences_in_text("george", "I am Georges") returns 0
    while
    count_occurences_in_text("i am", "I am Georges") returns 1
    """
    # We perform some measurements and manipulation at the start to
    # avoid performing them repeatedly in the loop below
    text = text.lower()
    word = word.lower()
    max_matches = text.count(word)
    if max_matches == 0:
        return 0
    word_len = len(word)
    # Counter vars
    match_count = 0
    text_cursor = 0
    word_cursor = 0
    # We will build up match_candidate and check it against word
    match_candidate = ""
    for text_char in text:
        if text_char == word[word_cursor]:
            match_candidate += text_char
            if word == match_candidate:
                if validate_full_match(text, text_cursor, word_len):
                    match_count += 1
                    if match_count == max_matches:
                        break
                    word_cursor = 0
                    match_candidate = ""
            else:
                word_cursor += 1
        else:
            match_candidate = ""
            word_cursor = 0
        text_cursor += 1
    return match_count

How can I optimize this text-matching function further?

Answers (1)

Related Questions