Algorithm optimizing

Question

The task is following:

Find and calculate the sum of all the possible substrings within a given input master string s, out of which you can form the word "tira" (Course abbr.) by further removing unnecessary letters.

EXAMPLE, return value 11 with input "tixratiyra": 1: tixratiyra, 2: tixratiyra, 3: tixratiyra, 4: tixratiyra, 5: tixratiyra, 6: tixratiyra, 7: tixratiyra, 8: tixratiyra, 9: tixratiyra, 10: tixratiyra, 11: tixratiyra.

I'm able to create a working piece of code but it won't run fast enough, it should be able to perform this task in O(n) time with a maximum input length of 10^5.

My code, working painfully slow:

def count(s):
    start = timeit.default_timer()

    c = "bcdefghjklmnopqsuvwxyz"
    last_char = ""
    indexes = set()
    unique_indexes = []

    last_A = s.rfind("a")
    last_R = s.rfind("r", 0, last_A)
    last_I = s.rfind("i", 0, last_R)
    last_T = s.rfind("t", 0, last_I)

    unique_tiras = ""

    for i in range(len(s)):
        char = s[i]
        if char not in c:
            if char == "t":
                if i <= last_T:
                    indexes.add("t")
                    last_char = "t"
                    unique_tiras += str(i) + "t"
            
            elif char == "i" and last_char != "i":
                if i <= last_I and "t" in indexes:
                    indexes.add("i")
                    last_char = "i"
                    unique_tiras = unique_tiras.replace("t", "i")

            elif char == "r" and last_char != "r":
                if i <= last_R and ("t" and "i") in indexes:
                    indexes.add("r")
                    last_char = "r"
                    unique_tiras = unique_tiras.replace("i", "r")

            elif char == "a":
                if i <= last_A and ("t" and "i" and "r") in indexes:
                    last_char = "a"
                    unique_tiras = unique_tiras.replace("r", f"-{i};")
                    pairs = unique_tiras.split(";")
                    unique_tiras = ""

                    for elements in pairs:
                        if "-" in elements:
                            Tindex = elements.split("-")
                            unique_indexes.append((int(Tindex[0]), int(Tindex[1])))
                            unique_tiras += Tindex[0] + "r"
                        
                        else:
                            unique_tiras += elements

    if len(unique_indexes) < 1:
        print("found no tira substrings with input '", s[0:20], "'")
        print("indexing took a total of", timeit.default_timer()-start, "s")

        return 0
    
    print("found a total of", len(unique_indexes), "tira substrings with input '", s[0:20], "'") #, which are the following:
    #print(unique_indexes)
    print("indexing took a total of", timeit.default_timer()-start, "s")

    start = timeit.default_timer()

    unique_substrings = set()

    for tiras in unique_indexes:
        begin = 0

        while begin <= tiras[0]:
            end = tiras[1]

            while end <= len(s) - 1:
                unique_substrings.add((begin, end))
                end += 1
            
            begin += 1

    print("calculating suitable substrings took a total of", timeit.default_timer()-start, "s")
    print("found suitable substrings a total of")

    return len(unique_substrings)

if __name__ == "__main__":
    print(count("ritari")) # 0
    print(count("taikurinhattu")) # 4
    print(count("ttiirraa")) # 4
    print(count("tixratiyra")) # 11 
    print(count("aotiatraorirratap")) # 42

chrslg · Accepted Answer

The answer itself is not O(n). More like O(n²). For example, for a string "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxtiraxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

answer is that you can choose 40 different starts for your substring, and 40 differents ends (there are 39 x. So you can decide to include 0 to 39 x as prefix, and 0 to 39 x as suffix). Which leads to 1600 possible substrings.

With "tira" in the middle, that is roughly (n/2)²

My point here is not to say "it is not possible". It is possible. I am about to do it :D.

My point is that it is not possible by enumerating all possibilities. Because if you are just trying lot of substrings, and counting those that are valid, then you have to try at least all working solutions (plus the non working one). And even assuming you can check an hypothesis in O(1) (which you probably can't), that means that the computation time as to be at least the same big-O as the result.

So we have to evaluate the number of working substring, without actually enumerating them and checking them

Here is my shot at it

def minIndex(s, sub):
    if not sub[0] in s:
        return None,None
    i0=s.index(sub[0])
    ix=i0
    for c in sub[1:]:
        ss=s[ix+1:]
        if not c in ss:
            return None, None
        ix=ss.index(c)+ix+1
    return i0, ix


def mycount(s, sub):
    tot=0
    while True:
        a,b=minIndex(s, sub)
        if a is None:
            return tot
        tot+=(a+1)*(len(s)-b)
        s=s[a+1:]
    return tot

def count(s):
    return mycount(s, "tira")

if __name__ == "__main__":
    print(count("ritari")) # 0
    print(count("taikurinhattu")) # 4
    print(count("ttiirraa")) # 4
    print(count("tixratiyra")) # 11 
    print(count("aotiatraorirratap")) # 42

What I do here, is some sort of greedy algorithm (but one that is exact). So, I search for the minimum index of the 't' in the string, and the minimum index of the 'a' that follows the 't' (with the other letters in between).

And that gives me, using the previous reasoning on my "xxxxxtiraxxxxx" example, a number of combinations.

Note that this is done in O(b), b being the said index.

Then, because of the possibility that there are several 't', each of them associated with potentially different ending 'a', I recompute the same thing, from all possible starting point after that 1st 't'. Then after the 3rd. ... Until there is no answer.

For example, for s='xxxxxtitrairaxxxxx', minIndex(s, 'tira') is (5,9). Which means that, to have a 'tira' starting with the first t, we have 6 possibles starting point (0,1,2,3,4,5). And for each starting point, we have len(s)-9=9 ending point (which makes sense: after the first possible ending a, we have iraxxxxx, that is 8 letters. We can include 0 to 8 of them as suffix of the substring).

In addition to those 6*9=54 possibilities that starts includes the first t, we must consider those that do not. So, we just have to do the same reasoning with s='itrairaxxxxx'.

We have 1 letter before the first t. And the first possible a (which is not the same as before: the first a was a possible ending sooner. But starting from the second t it not longer is, and any working substring must also includes the second a now), is followed by 5 x. So, that is 2 possibles prefixes (i or nothing), and 6 possibles suffixes (``, x, xx, xxx, xxxx, xxxxx).. So 12.

Again, once that done, we try what we could to without that 2nd starting t. So we retry with 'rairaxxxxx'. It gives nothing. End of the computation.

Final answer is 54+12 = 66

Each stage of the computation iterates partially the whole string. Altogether it is O(n).

Well, not exactly O(n), in some worst case scenario. But that is because I was a bit lazy.

For example with string tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttira, as is, my algorithm is O(n²) : each t is an iteration in the mycount while loop. And minIndex, because the only a is at the very end, must iterates the whole string to find its index. So in such case, it is O(n²) (even if it feels like O(n), because that inner O(n) is inside the .index() code, which is internal, probably C, code, so very very fast, compare the the rest of pure python code).

But that is avoidable: I could have memorized the position of the first i, r and a. And recompute position of a index only when we passed that first i while skipping t (and even when we do, also check that we passed the first r while searching another i, etc.).

And doing so, it would ensure that it is really O(n) (because either you need to update the first a all the time, but then that means that it is near the first t, and the overall algorithm is O(n) — O(n×len('tira')) to be accurate ; or you don't, and then computation of that .index('a') is done only a few times, and algorithm is also O(n)).

Anyway, pragmatically, that algorithm is way faster than your version, and do not need to enumerate and verify possible solutions to count them. For example for "x"*1000 + "tira" + "x"*1000 your code takes 0.54 seconds, while this one takes 4.5 μs.

For a disaster test (worst case scenario for my algorithm) s="t"*10000+"ira", your code takes 31.7 seconds, mine 35 ms.

Algorithm optimizing

Answers (1)

Related Questions