Greg Kaleka
Greg Kaleka

Reputation: 2000

How can I optimize this text-matching function further?

I need to make this function run much faster (~20x faster) to meet a required benchmark. I have made quite a few improvements from my initial implementation, but am hitting a wall.

The basic problem is this: count case-insensitive occurrences of word in text.

Complicating criteria include:

  1. Must be a whole word (word "George" is not found in text "Georges")
  2. Single quotes are to be considered part of words, unless there are more than one in a row
  3. word may actually be a phrase (meaning it could have spaces, punctuation, etc.)
  4. Cannot use regex

My basic implementation has been to loop through each character in text, maintaining my position in word and if the character matches the corresponding character of word, I add it to a local string, advance my position in word and text, and go again. Once I have a match candidate (i.e. my local string is equal to word), I check the surrounding characters to ensure the match candidate is a full word, per rules 1 and 2 above. Note this checking does not occur frequently enough to materially impact the total time the algorithm takes.

Most successful optimizations I have made so far:

I've profiled the code line-by-line using pprofile, and the majority of my code's runtime is simple lines like incrementing the counter var, resetting the match_candidate string to "", indexing into strings, and if statements. I have not included the code for validate_full_match as it is not a significant time user.

Are there any low-hanging fruit I'm ignoring? A different approach entirely I should consider?

Thanks for any suggestions!

def count_occurences_in_text(word, text):
    """Number of occurences of word (case insensitive) in text

    Note that word can actually be any length of text, from a single
    character to a complete phrase; however, partial words do not
    count. For example:
    count_occurences_in_text("george", "I am Georges") returns 0
    while
    count_occurences_in_text("i am", "I am Georges") returns 1
    """
    # We perform some measurements and manipulation at the start to
    # avoid performing them repeatedly in the loop below
    text = text.lower()
    word = word.lower()
    max_matches = text.count(word)
    if max_matches == 0:
        return 0
    word_len = len(word)
    # Counter vars
    match_count = 0
    text_cursor = 0
    word_cursor = 0
    # We will build up match_candidate and check it against word
    match_candidate = ""
    for text_char in text:
        if text_char == word[word_cursor]:
            match_candidate += text_char
            if word == match_candidate:
                if validate_full_match(text, text_cursor, word_len):
                    match_count += 1
                    if match_count == max_matches:
                        break
                    word_cursor = 0
                    match_candidate = ""
            else:
                word_cursor += 1
        else:
            match_candidate = ""
            word_cursor = 0
        text_cursor += 1
    return match_count

Upvotes: 1

Views: 90

Answers (1)

Msk
Msk

Reputation: 857

  1. Python strings are immutable, every time you perform a match_candidate += text_char you are effectively making a new string and copying all contents of previous version of match_candidate to it. Let's say your word is 'helloworld'. When there is chance of matching with 'helloworl' in text, you perform (len(word)^2) operations here. You can surely avoid that by maintaining an index. This can save a lot of operations.
  2. max_matches = text.count(word), you can avoid this by checking if you have reached the end of the text. This function initially will cost you O(len(text)) that you can avoid.
  3. validate_full_match what is checked in this function. You can avoid this if this by doing proper steps when comparing individual characters.

Python is easy to code and has amazing inbuilt functions and constructs. But to optimize, you need to ensure that you keep a track of complexity of every line.

Upvotes: 3

Related Questions