Python string comparison performance is inconsistent

Question

I wanted to check how string comparison works (I wanted to see if it is char by char, and if it checks the length of the string before the comparison), so I've used this code:

s1 = 'abc'
s2 = 'abcd'
s3 = 'dbc'
s4 = 'abd'
t1 = time.clock()
s1==s2
print time.clock() - t1
t2 = time.clock()
s1==s3
print time.clock() - t2
t3 = time.clock()
s1==s4
print time.clock() - t3

When I've tried the same thing on very long strings (~30MB text files) it worked great and I found out that it does perform a length check, and also it compares char by char. But when I've tried it on short string (such as the string in the code above) the performance results were very inconsistent. Any one has any idea why they were inconsistent or what have I done wrong? (perhaps I was wrong and the comparison doesn't work as I thought?)

Edit: An example for something I've also tried is to compare different lengths of string with a specific string. I thought that the one that takes the longest to perform will be the string with the exact length of the other one because the rest will fall in the length check, but it was inconsistent as well). lets say that the string I'm checking is 'hello', so I've compared 'a', 'aa', 'aaa', and so on... I was expecting to see that the longest check will be 'aaaaa' but it was 'a' and I have no idea why.

Kevin · Accepted Answer

You are correct that strings compare lengths before comparing contents (at least in 2.7). Here is the relevant portion of string_richcompare:

if (op == Py_EQ) {
    /* Supporting Py_NE here as well does not save
       much time, since Py_NE is rarely used.  */
    if (Py_SIZE(a) == Py_SIZE(b)
        && (a->ob_sval[0] == b->ob_sval[0]
        && memcmp(a->ob_sval, b->ob_sval, Py_SIZE(a)) == 0)) {
        result = Py_True;
    } else {
        result = Py_False;
    }
    goto out;
}

In simple terms, the checks appear to be, in order:

if the strings have the same memory address, they are equal. (not pictured in the above code)
if the strings have a different size, they are not equal.
if the strings have a different first character, they are not equal.
if the strings have identical character arrays, they are equal.

The third check doesn't appear to be strictly necessary, but is probably an optimization if manually checking array contents is faster than calling memcmp.

If your benchmarking suggested to you that comparing strings of different length is slower than comparing strings of the same length, this is probably a false alarm caused by the not entirely dependable behavior of clock, as covered in other answers and comments.

Python string comparison performance is inconsistent

Answers (2)

Related Questions