pystudent
pystudent

Reputation: 571

Python string comparison performance is inconsistent

I wanted to check how string comparison works (I wanted to see if it is char by char, and if it checks the length of the string before the comparison), so I've used this code:

s1 = 'abc'
s2 = 'abcd'
s3 = 'dbc'
s4 = 'abd'
t1 = time.clock()
s1==s2
print time.clock() - t1
t2 = time.clock()
s1==s3
print time.clock() - t2
t3 = time.clock()
s1==s4
print time.clock() - t3

When I've tried the same thing on very long strings (~30MB text files) it worked great and I found out that it does perform a length check, and also it compares char by char. But when I've tried it on short string (such as the string in the code above) the performance results were very inconsistent. Any one has any idea why they were inconsistent or what have I done wrong? (perhaps I was wrong and the comparison doesn't work as I thought?)

Edit: An example for something I've also tried is to compare different lengths of string with a specific string. I thought that the one that takes the longest to perform will be the string with the exact length of the other one because the rest will fall in the length check, but it was inconsistent as well). lets say that the string I'm checking is 'hello', so I've compared 'a', 'aa', 'aaa', and so on... I was expecting to see that the longest check will be 'aaaaa' but it was 'a' and I have no idea why.

Upvotes: 0

Views: 315

Answers (2)

Kevin
Kevin

Reputation: 76254

You are correct that strings compare lengths before comparing contents (at least in 2.7). Here is the relevant portion of string_richcompare:

if (op == Py_EQ) {
    /* Supporting Py_NE here as well does not save
       much time, since Py_NE is rarely used.  */
    if (Py_SIZE(a) == Py_SIZE(b)
        && (a->ob_sval[0] == b->ob_sval[0]
        && memcmp(a->ob_sval, b->ob_sval, Py_SIZE(a)) == 0)) {
        result = Py_True;
    } else {
        result = Py_False;
    }
    goto out;
}

In simple terms, the checks appear to be, in order:

  • if the strings have the same memory address, they are equal. (not pictured in the above code)
  • if the strings have a different size, they are not equal.
  • if the strings have a different first character, they are not equal.
  • if the strings have identical character arrays, they are equal.

The third check doesn't appear to be strictly necessary, but is probably an optimization if manually checking array contents is faster than calling memcmp.


If your benchmarking suggested to you that comparing strings of different length is slower than comparing strings of the same length, this is probably a false alarm caused by the not entirely dependable behavior of clock, as covered in other answers and comments.

Upvotes: 2

unutbu
unutbu

Reputation: 880767

You are liable to get inconsistent results when measuring very small times. You'll get better result by repeating the operation a great many times so that the difference is substantial:

t1 = time.clock()
for i in range(10**6):
    s1 == s2
t2 = time.clock()

Better yet, use the timeit module to handle the repetition (and other details like turning off garbage collection) for you:

import timeit

s1 = 'abc'
s2 = 'abcd'
s3 = 'dbc'
s4 = 'abd'
t1 = timeit.timeit('s1==s2', 'from __main__ import s1, s2', number=10**8)
t2 = timeit.timeit('s1==s3', 'from __main__ import s1, s3', number=10**8)
t3 = timeit.timeit('s1==s4', 'from __main__ import s1, s4', number=10**8)
for t in (t1, t2, t3):
    print(t)

yields

2.82305312157
2.83096408844
3.15551590919

Thus s1==s2 and s1==s3 take essentially the same amount of time. s1==s4 requires a bit more time because more characters have to be compared before the equality can return False.


By the way, while time.clock is used by timeit.default_timer for measuring time on Windows, time.time is used by timeit.default_timer for measuring time on Unix. Use timeit.default_timer instead of time.clock or time.time to make your code more cross-platform compatible.

Upvotes: 2

Related Questions