Samuel Wen
Samuel Wen

Reputation: 41

Which way is better to compare two strings in Python? I want to know which one is faster

Suppose two variables a and b stores two strings, I want to compare whether the two strings are same or not. In Python, the typical way should be a == b. I believe the time complexity of that is O(min(m,n)), where m and n are the length of string. But if I compare them in this way a in {b} (I add string b to a set and check if string a in that set), will the time complexity be O(1), regardless of the length of strings in a and b?

Upvotes: 3

Views: 3951

Answers (3)

Nicolas Forstner
Nicolas Forstner

Reputation: 491

No, as the in operator first computes the hash of the string (linear complexity) and then does another comparison if an element with the same hash is in the set (again, linear complexity). So, a == b is your best bet, as it skips the computation of the hash and avoids the construction of the set.

The constant time complexity of the in operator is with regards to the length of the set. It is still linear in regards to the length of the string.

However, the following results seem to disagree:

$ python3 -m timeit "'aaa' == 'aaa'"
20000000 loops, best of 5: 18.9 nsec per loop
$ python3 -m timeit "'aaa' == 'bbb'"
10000000 loops, best of 5: 22.1 nsec per loop
$ python3 -m timeit "'aaa' in {'aaa'}"
20000000 loops, best of 5: 17 nsec per loop
$ python3 -m timeit "'aaa' in {'bbb'}"
20000000 loops, best of 5: 16 nsec per loop

A slightly altered example seems to make more sense...

$ python3 -m timeit "a = 'aaa'; b = 'bbb'; a in {b}"
5000000 loops, best of 5: 63 nsec per loop
$ python3 -m timeit "a = 'aaa'; b = 'bbb'; a in {a}"
5000000 loops, best of 5: 55.7 nsec per loop
$ python3 -m timeit "a = 'aaa'; b = 'bbb'; a == b"
10000000 loops, best of 5: 30.1 nsec per loop
$ python3 -m timeit "a = 'aaa'; b = 'bbb'; a == a"
10000000 loops, best of 5: 27 nsec per loop

The resaon for this can be found in the disassembly of the code:

from dis import dis

def a():
    'aaa' in {'aaa'}

def b():
    x = 'aaa'
    'aaa' in {x}

print(dis(a))
print(dis(b))

Yields:

  5           0 LOAD_CONST               1 ('aaa')
              2 LOAD_CONST               2 (frozenset({'aaa'}))
              4 COMPARE_OP               6 (in)
              6 POP_TOP
              8 LOAD_CONST               0 (None)
             10 RETURN_VALUE
None
 12           0 LOAD_CONST               1 ('aaa')
              2 STORE_FAST               0 (x)

 13           4 LOAD_CONST               1 ('aaa')
              6 LOAD_FAST                0 (x)
              8 BUILD_SET                1
             10 COMPARE_OP               6 (in)
             12 POP_TOP
             14 LOAD_CONST               0 (None)
             16 RETURN_VALUE
None

Since we are using only literals for the first approach, python can make one of its (extremely rare) optimisations and use LOAD_CONST instead of having to construct the set during runtime.

For more realistic code however, this is no longer the case and the expensive BUILD_SET has to be used.

Upvotes: 6

Gwang-Jin Kim
Gwang-Jin Kim

Reputation: 9865

Time complexity-wise, it will be faster to compare a is b for strings than a == b, given one can exclude that none of them is NaN.

Okay, I thought this is "smart" but see below a counterexample why using is is fatal - by @khelwood! The article suggested by @JohnColeman is also very interesting. (Thanks!)

Upvotes: -2

Locke
Locke

Reputation: 8964

No. This just hides the check. To check if a value is in a set, it first needs to hash the member it is checking. Then, if it finds a match for the hash it needs to call their implementation of equals to verify if the matching hash was just a false positive. So now you end up with O(m+n+min(m, n)).

That being said, since strings are such a core part of computer science they have a lot of various optimizations behind the scene. Some strings might have already been hashed ahead of time and that value was cached. Or maybe a and b point to the same string so it can exit early. Or their lengths are different so it can quickly exit. Attempting to add your own optimizations for something as important as strings will likely only make it more difficult for the system you are working to help you.

Upvotes: 1

Related Questions