(2n log(n) + n) anagram detection function not much slower than 4n + 26 function despite huge n

Question

I have 2 anagram detection functions; one uses sort and compare and the other keeps track of the number of occurences of each alphabetical character.

Assume here that the two strings passed to the functions are identical, the first randomly generated (unsorted) and the second = to the first, so that both functions execute "all the way" and return True.

According to my algorithmic analysis (obviously probably wrong...), the first function should be 2 * n log(n) + n, and the second function 2 + 2n + 2n + 26 = 28 + 5n.

def anag1(s1, s2):      
    s1 = list(s1)
    s1.sort()
    s2 = list(s2)
    s2.sort()

    for i in range(len(s1)):
        if s1[i] != s2[i]:
            return False

    return True

def anag2(s1, s2):
    l1 = [0] * 26
    l2 = [0] * 26
    for i in s1:
        k = ord(i) % 26
        l1[k] += 1
    for i in s2:
        k = ord(i) % 26
        l2[k] += 1

    return l1 == l2

For 2 strings of 100,000 characters, this should mean that the first function would be 3.4 million time units vs 400,000 time units for the second function.

In other words, the second function should be close to 8 times faster than the first.

However, when I time the execution of both function, the second one is not even twice as fast as the first. For 2 strings of 100,000 characters, the first function executes in 0.085s and the second in 0.055s.

What's going on?

Tomas F. · Accepted Answer

Theoretical complexity may have little with actual time spent. Consider:

Various complexity of single operations (e.g. division vs addition)
Memory access (bad cache hit ratio due to frequent random access on huge arrays can slow it down to more than half)
Compiler / interpreter optimizations
etc.

And not every sort is O(n log (n))

Counting sort, Bucket Sort, Radix Sort are all O(n) or close to.

Edit: I implemented both in Java on 100M number long array. It was 383 vs. 161 ms. On 10M the times were almost equal.

Your 100k long array is way too small to warm up optimizers.
Java uses DualPivotQuickSort, which runs almost O(n) on arrays with many duplicate values (small character cardinality). Your language may use something similar.

(2n log(n) + n) anagram detection function not much slower than 4n + 26 function despite huge n

Answers (2)

Related Questions