user1502381
user1502381

Reputation: 145

Reduce repeated string search from quadratic to faster

Apologies if the title was not accurate enough of a description.

I have a program that searches a string for vowels, but it does it over sliding windows of specific sizes across the string. I want to compute the density of vowels. Pretend that I don't care about there being whitespace and symbols; it's just left there. The obvious solution I have coded as follows:

s = "Lorem ipsum dolor sit amet, consectetur adipiscing elit ... consequat."
# assume s continues after the "..."
v = "aeiou"
v_dict = {}
max_win_size = 100
if max_win_size > len(s): max_win_size = len(s)

for i in range(1, max_win_size):
    v_dict[i] = {}
    for j in range(0, i+1):
        v_dict[i][j] = 0 # set counts to zero
    for j in range(1, len(s)-i+1):
        s_slice = s[j:j+i].lower()
        v_count = sum([s_slice.count(c) for c in v])
        v_dict[i][v_count] += 1

This gives eventually gives me the frequency of vowels at different sliding windows from 1 to size of the string. This program works as I want it to, but the problem is that it is very slow. It's clearly quadratic, and I would like to improve the efficiency, because as text gets larger the program takes exponentially more time. The problem is that I am unsure how to transform the problem space into a more efficient algorithm.

Does anyone have any ideas as to how to make this, say, loglinear?

Edit:

I tried the answer by Ben Voigt implemented by cr1msonB1ade. It worked as advertised. In addition, I thought I would include empirical evidence to the speed claims.

First is the run time for increasing string size. Both functions perform linearly, however the overhead of using pure python to implement it results in a significantly greater coefficient on the linear run time. Run time for increasing string size

Second is the run time for increasing window size. I fixed the window size and ran larger and larger window sizes. This time, the quadratic rate of my function revealed itself, while the numpy cumulative sum function remained linear.

enter image description here

Upvotes: 1

Views: 89

Answers (2)

cr1msonB1ade
cr1msonB1ade

Reputation: 1716

In terms of theoretical run time you are calculating max_win_size * (len(s)-max_win_size+1) values, so there is no way to get below o(max_win_size*len(s)) run time.

In terms of actual computational run time there are a couple issues with your current code. First there is no reason to do letter matching on each string. You should first convert your string into a TRUE FALSE list and then query that instead.

Second you can dynamically update the sums when moving your sliding window. In other words you only have to query the letter you are dropping and the one you are adding to get the number of vowels in the next substring.

Also you don't really need a dictionary for the second level of your data structure. You know exactly how long the list is and you will be indexing with integer values, so just use a list.

I'm sure there may be some other efficiencies that could be added, but putting these together we have:

s = "Lorem ipsum dolor sit amet, consectetur adipiscing elit ... consequat."
# assume s continues after the "..."
v = list("aeiou")
isVowel = [l in v for l in s]
v_dict = {}
max_win_size = 100
if max_win_size > len(s): max_win_size = len(s)

for i in range(1, max_win_size):
    v_dict[i] = [0,] * (i+1)
    curr_vowels = sum(isVowel[:i])
    v_dict[i][curr_vowels]  += 1
    for curr_pos in range(1, len(s)-i):
        # add next letter position and subtract letter falling out of window
        curr_vowels += isVowel[curr_pos + i] - isVowel[curr_pos - 1]
        v_dict[i][curr_vowels] += 1

EDIT: Using cumulative method as @Ben recommends:

import numpy as np
s = "Lorem ipsum dolor sit amet, consectetur adipiscing elit ... consequat."
# assume s continues after the "..."
v = list("aeiou")
cumVowels = [0,] + np.cumsum([l in v for l in s])
v_dict = {}
max_win_size = 100
if max_win_size > len(s): max_win_size = len(s)

for i in range(1, max_win_size):
    v_dict[i] = [0,] * (i+1)
    for pos in range(0, len(s)-i):
        v_dict[i][cumVowels[pos + i] - cumVowels[pos]] += 1

Upvotes: 2

Ben Voigt
Ben Voigt

Reputation: 283733

Rewrite the algorithm based on cumulative counts.

The count of occurrences between position i and j (inclusive) is just cumul_count(j) - cumul_count(i-1), and the computation necessary doesn't increase with the window size.

Upvotes: 3

Related Questions