Moritz Wolff
Moritz Wolff

Reputation: 508

Time complexity analysis of two algorithms contradicts empirical results

I wrote the following simple function that checks whether str1 is a permutation of str2:

def is_perm(str1, str2):
    return True if sorted(str1)==sorted(str2) else False
    

Assuming that sorted(str) has a time complexity of O(n*logn), we can expect a time complexity of O(2*n*logn)=O(n*logn). The following function is an attempt to achieve a better time complexity:


def is_perm2(str1, str2):
    dict1 = {}
    dict2 = {}
    
    for char in str1:
        if char in dict1:
            dict1[char] += 1 
        else:
            dict1[char] = 1
    for char in str2:
        if char in dict2:
            dict2[char] += 1         
        else:
            dict2[char] = 1
        
    if dict1==dict2:
        return True
    else:
        return False
    

Each for-loop iterates n times. Assuming that dictionary lookup and both dictionary updates have constant time complexity, I expect an overall complexity of O(2n)=O(n). However, timeit measurements show the following, contradicting results. Why is is_perm2 slower than is_perm after 1000000 executions even though it's time complexity looks better? Are my assumptions wrong?

import timeit

print(timeit.timeit('is_perm("helloworld","worldhello")', 'from __main__ import is_perm', number=10000000))
print(timeit.timeit('is_perm2("helloworld","worldhello")', 'from __main__ import is_perm2', number=10000000))

# output of first print-call: 12.4199592999993934 seconds
# output of second print-call: 37.13826630001131 seconds

Upvotes: 3

Views: 308

Answers (2)

trincot
trincot

Reputation: 349964

There is no guarantee that an algorithm with a time complexity of O(nlogn) will be slower than one with a time complexity of O(n) for a given input. The second one could for instance have a large constant overhead, making it slower for input sizes that are below 100000 (for instance).

In your test the input size is 10 ("helloworld"), which doesn't tell us much. Repeating that test doesn't make a difference, even if repeated 10000000 times. The repetition only gives a more precise estimate of the average time spent on that particular input.

You would need to feed the algorithm with increasingly large inputs. If memory allows, that would eventually bring us to an input size for which the O(nlogn) algorithm takes more time than the O(n) algorithm.

In this case, I found that the input size had to be really large in comparison with available memory, and I only barely managed to find a case where the difference showed:

import random
import string
import timeit

def shuffled_string(str):
    lst = list(str)
    random.shuffle(lst)
    return "".join(lst)

def random_string(size):
    return "".join(random.choices(string.printable, k=size))

str1 = random_string(10000000)
str2 = shuffled_string(str1)
print("start")
print(timeit.timeit(lambda: is_perm(str1, str2), number=5))
print(timeit.timeit(lambda: is_perm2(str1, str2), number=5))

After the initial set up of the strings (which each have a size of 10 million characters), I got this output:

54.72847577700304
51.07616817899543

The reason why the input has to be so large to see this happen, is that sorted is doing all the hard work in lower-level, compiled code (often C), while the second solution does all the looping and character reading in Python code (often interpreted). It is clear that the overhead of the second solution is huge in comparison with the first.

Improving the second solution

Although not your question, we could improve the implementation of the second algorithm, by relying on another built-in function: Counter:

def is_perm3(str1, str2):
    return Counter(str1) == Counter(str2)

With the same test set up as above, I got this timing for this implementation:

24.917681352002546

Upvotes: 2

Code Plus
Code Plus

Reputation: 170

Assuming that dictionary lookup and both dictionary updates have constant time complexity,

python dictionary is hashmap. So exactly, dictionary lookup and update costs O(n) in worst case. Total average time complexity of is_perm2 is O(n) but worse case time complexity is O(n^2).

If you want get exactly O(n) time complexity, please use List(not Dictionary) to store frequency of characters. You can easily convert each character to ascii numbers and store their frequency to python list.

Upvotes: 0

Related Questions