Reputation: 389
I have a large list that contains usernames (about 60,000 strings). Each username represents a submission. Some users have made only one submission i.e. they are "one-time users", so their username appears only once in this list. Others have made multiple submission (returning users) so their username can appear many times in this list. I want to count how many of these one-time users there are and get some stats based on that. Here are the variables I'm currently grabbing:
import time
start_time = time.time()
users = ["UserA", "UserB", "UserC", "UserA", "UserA", "UserA", "UserB", "UserB", "UserD"] # ...just a sample, this goes up to ~60,000 elements
print(f"1: got users list. Time elapsed: {time.time() - start_time}")
one_time_users = [user for user in users if users.count(user) == 1]
print(f"2: got one-time users list. Time elapsed: {time.time() - start_time}")
returning_users = [user for user in users if users.count(user) != 1]
print(f"3: got returning users list. Time elapsed: {time.time() - start_time}")
frequencies = [users.count(user) for user in set(users)]
print(f"4: calculated frequencies list. Time elapsed: {time.time() - start_time}")
sorted_frequencies = sorted(frequencies, reverse=True) # Descending order, largest first
print(f"5: got sorted frequencies list. Time elapsed: {time.time() - start_time}")
top_ten_frequencies_sum = sum(sorted_frequencies[:10])
print(f"6: got top 10 frequencies sum. Time elapsed: {time.time() - start_time}")
top_ten_frequencies_percentage = round(((top_ten_frequencies_sum / len(users)) * 100), 2)
print(f"7: got top 10 frequencies percentage. Time elapsed: {time.time() - start_time}")
average_submissions_per_user = round(len(users) / len(set(users)), 2)
print(f"8: got average submissions per user. Time elapsed: {time.time() - start_time}")
This operation is very slow. Here is my output:
1: got users list. Time elapsed: 0.41695237159729004
2: got one-time users list. Time elapsed: 48.26731848716736
3: got returning users list. Time elapsed: 101.88410639762878
4: calculated frequencies list. Time elapsed: 104.39784860610962
5: got sorted frequencies list. Time elapsed: 104.39850783348083
6: got top 10 frequencies sum. Time elapsed: 104.39853930473328
7: got top 10 frequencies percentage. Time elapsed: 104.39856457710266
8: got average submissions per user. Time elapsed: 104.4005241394043
As you can see the list comprehensions are taking the most time. Can someone explain to me:
Thank you!
Upvotes: 1
Views: 122
Reputation: 4427
As mentioned in your own comment, Counter is significantly faster here. You can see from your own timing that creating a set of the results takes around 10ms to complete (#8->#9), which is roughly the time Counter will take as well.
With counter you look at at each of the N elements once, and then at each unique element (at most N) once.
When you use .count()
you iterate through the entire list (a fast implementation, but still the entire list). You do this for every element, so you look at each of N elements N times.
Every time your list gets 1000x bigger you require 1000x the time for the Counter method, but 1000000x for .count versions.
Upvotes: 1
Reputation: 11929
You can improve by using a Counter, in 2.
for each element you are iterating the whole list, and you are doing this multiple times for the same user if an user occurs more than once.
Note that when you use users.count(user)
you iterate all the list of users to count how many times the user occurs. This means quadratic complexity with respect to the length of the list.
Using a counter, you can solve it in linear complexity.
Also, in 4.
you are iterating and counting again, while you could just remove the ones you just computed from the whole users.
Example.
>>> one_time_users = {user for user,cnt in Counter(users).items() if cnt == 1}
{'UserC', 'UserD'}
>>> returning_users = set(users) - one_time_users
>>> returning_users
{'UserB', 'UserA'}
or more directly
one_time_users, returning_users = [], []
for user,cnt in Counter(users).items():
if cnt==1:
one_time_users.append(user)
else:
returning_users.append(user)
Here a comparison of l.count(el)
vs a Counter(l)
.
>>> l = random.choices(range(500), k=60000)
>>> timeit.timeit('[el for el in l if l.count(el) == 1]',setup='from __main__ import l',number=1)
71.70409394335002
>>> timeit.timeit('[el for el,cnt in Counter(l).items() if cnt == 1]',setup='from __main__ import l, Counter',number=1)
0.005492186639457941
Upvotes: 2