Reputation: 63
I am trying to isolate specific items in a list (e.g. [0, 1, 1]
will return [0, 1]
). I managed to get through this, but I noticed something strange.
When I tried to append to list it ran about 7 times slower then when I was concatenating strings and then splitting it.
This is my code:
import time
start = time.time()
first = [x for x in range(99999) if x % 2 == 0]
second = [x for x in range(99999) if x % 4 == 0]
values = first + second
distinct_string = ""
for i in values:
if not str(i) in distinct_string:
distinct_string += str(i) + " "
print(distinct_string.split())
print(" --- %s sec --- " % (start - time.time()))
This result end in about 5 seconds... Now for the lists:
import time
start = time.time()
first = [x for x in range(99999) if x % 2 == 0]
second = [x for x in range(99999) if x % 4 == 0]
values = first + second
distinct_list = []
for i in values:
if not i in distinct_list:
distinct_list.append(i)
print(distinct_list)
print(" --- %s sec --- " % (start - time.time()))
Runs at around 40 seconds.
What makes string faster even though I am converting a lot of values to strings?
Upvotes: 4
Views: 2196
Reputation: 4209
I think there is a flaw of logic that makes the string method seemingly much faster.
When matching substrings in a long string, the in
operator will return prematurely at the first substring containing the search item. To prove this, I let the loop run backwards from the highest values down to the smallest, and it returned only 50% of the values of the original loop (I checked the length of the result only). If the matching was exact there should be no difference whether you check the sequence from the start or from the end. I conclude that the string method short-cuts a lot of comparisons by matching near the start of the long string. The particular choice of duplicates is unfortunately masking this.
In a second test, I let the string method search for " " + str(i) + " "
to eliminate substring matches. Now it will run only about 2x faster than the list method (but still, faster).
@jonrsharpe: Regarding the set_method I cannot see why one would touch all set elements one by one and not in one set
statement like this:
def set_method(values):
return list(set(values))
This produces exactly the same output and runs about 2.5x faster on my PC.
Upvotes: 1
Reputation: 5658
searching in strings:
if not str(i) in distinct_string:
is much faster
then searching in lists
if not i in distinct_list:
here are lprofile lines for string search in OP code
Line # Hits Time Per Hit % Time Line Contents
17 75000 80366013 1071.5 92.7 if not str(i) in distinct_string:
18 50000 2473212 49.5 2.9 distinct_string += str(i) + " "
and for list search in OP code
39 75000 769795432 10263.9 99.1 if not i in distinct_list:
40 50000 2813804 56.3 0.4 distinct_list.append(i)
Upvotes: 1
Reputation: 121966
Note that it's generally better to use timeit
to compare functions, which runs the same thing multiple times to get average performance, and to factor out repeated code to focus on the performance that matters. Here's my test script:
first = [x for x in range(999) if x % 2 == 0]
second = [x for x in range(999) if x % 4 == 0]
values = first + second
def str_method(values):
distinct_string = ""
for i in values:
if not str(i) in distinct_string:
distinct_string += str(i) + " "
return [int(s) for s in distinct_string.split()]
def list_method(values):
distinct_list = []
for i in values:
if not i in distinct_list:
distinct_list.append(i)
return distinct_list
def set_method(values):
seen = set()
return [val for val in values if val not in seen and seen.add(val) is None]
if __name__ == '__main__':
assert str_method(values) == list_method(values) == set_method(values)
import timeit
funcs = [func.__name__ for func in (str_method, list_method, set_method)]
setup = 'from __main__ import {}, values'.format(', '.join(funcs))
for func in funcs:
print(func)
print(timeit.timeit(
'{}(values)'.format(func),
setup=setup,
number=1000
))
I've added int conversion to make sure that the functions return the same thing, and get the following results:
str_method
1.1685157899992191
list_method
2.6124089090008056
set_method
0.09523714500392089
Note that it is not true that searching in a list is faster than searching in a string if you have to convert the input:
>>> timeit.timeit('1 in l', setup='l = [9, 8, 7, 6, 5, 4, 3, 2, 1]')
0.15300405000016326
>>> timeit.timeit('str(1) in s', setup='s = "9 8 7 6 5 4 3 2 1"')
0.23205067300295923
Repeated append
ing to a list is not very efficient, as it means frequent resizing of the underlying object - the list comprehension, as shown in the set
version, is more efficient.
Upvotes: 4