Reputation: 571
I wanted to check how string comparison works (I wanted to see if it is char by char, and if it checks the length of the string before the comparison), so I've used this code:
s1 = 'abc'
s2 = 'abcd'
s3 = 'dbc'
s4 = 'abd'
t1 = time.clock()
s1==s2
print time.clock() - t1
t2 = time.clock()
s1==s3
print time.clock() - t2
t3 = time.clock()
s1==s4
print time.clock() - t3
When I've tried the same thing on very long strings (~30MB text files) it worked great and I found out that it does perform a length check, and also it compares char by char. But when I've tried it on short string (such as the string in the code above) the performance results were very inconsistent. Any one has any idea why they were inconsistent or what have I done wrong? (perhaps I was wrong and the comparison doesn't work as I thought?)
Edit: An example for something I've also tried is to compare different lengths of string with a specific string. I thought that the one that takes the longest to perform will be the string with the exact length of the other one because the rest will fall in the length check, but it was inconsistent as well). lets say that the string I'm checking is 'hello', so I've compared 'a', 'aa', 'aaa', and so on... I was expecting to see that the longest check will be 'aaaaa' but it was 'a' and I have no idea why.
Upvotes: 0
Views: 315
Reputation: 76254
You are correct that strings compare lengths before comparing contents (at least in 2.7). Here is the relevant portion of string_richcompare
:
if (op == Py_EQ) {
/* Supporting Py_NE here as well does not save
much time, since Py_NE is rarely used. */
if (Py_SIZE(a) == Py_SIZE(b)
&& (a->ob_sval[0] == b->ob_sval[0]
&& memcmp(a->ob_sval, b->ob_sval, Py_SIZE(a)) == 0)) {
result = Py_True;
} else {
result = Py_False;
}
goto out;
}
In simple terms, the checks appear to be, in order:
The third check doesn't appear to be strictly necessary, but is probably an optimization if manually checking array contents is faster than calling memcmp
.
If your benchmarking suggested to you that comparing strings of different length is slower than comparing strings of the same length, this is probably a false alarm caused by the not entirely dependable behavior of clock
, as covered in other answers and comments.
Upvotes: 2
Reputation: 880767
You are liable to get inconsistent results when measuring very small times. You'll get better result by repeating the operation a great many times so that the difference is substantial:
t1 = time.clock()
for i in range(10**6):
s1 == s2
t2 = time.clock()
Better yet, use the timeit module to handle the repetition (and other details like turning off garbage collection) for you:
import timeit
s1 = 'abc'
s2 = 'abcd'
s3 = 'dbc'
s4 = 'abd'
t1 = timeit.timeit('s1==s2', 'from __main__ import s1, s2', number=10**8)
t2 = timeit.timeit('s1==s3', 'from __main__ import s1, s3', number=10**8)
t3 = timeit.timeit('s1==s4', 'from __main__ import s1, s4', number=10**8)
for t in (t1, t2, t3):
print(t)
yields
2.82305312157
2.83096408844
3.15551590919
Thus s1==s2
and s1==s3
take essentially the same amount of time. s1==s4
requires a bit more time because more characters have to be compared before the equality can return False.
By the way, while time.clock
is used by timeit.default_timer
for measuring
time on Windows, time.time
is used by timeit.default_timer
for measuring
time on Unix. Use timeit.default_timer
instead of time.clock
or time.time
to make your code more cross-platform compatible.
Upvotes: 2