Counting elements in a list that only occur once: how do I optimise my performance? It's very slow

Question

I have a large list that contains usernames (about 60,000 strings). Each username represents a submission. Some users have made only one submission i.e. they are "one-time users", so their username appears only once in this list. Others have made multiple submission (returning users) so their username can appear many times in this list. I want to count how many of these one-time users there are and get some stats based on that. Here are the variables I'm currently grabbing:

import time

start_time = time.time()

users = ["UserA", "UserB", "UserC", "UserA", "UserA", "UserA", "UserB", "UserB", "UserD"] # ...just a sample, this goes up to ~60,000 elements
print(f"1: got users list. Time elapsed: {time.time() - start_time}")

one_time_users = [user for user in users if users.count(user) == 1]
print(f"2: got one-time users list. Time elapsed: {time.time() - start_time}")

returning_users = [user for user in users if users.count(user) != 1]
print(f"3: got returning users list. Time elapsed: {time.time() - start_time}")

frequencies = [users.count(user) for user in set(users)]
print(f"4: calculated frequencies list. Time elapsed: {time.time() - start_time}")

sorted_frequencies = sorted(frequencies, reverse=True) # Descending order, largest first
print(f"5: got sorted frequencies list. Time elapsed: {time.time() - start_time}")

top_ten_frequencies_sum = sum(sorted_frequencies[:10])
print(f"6: got top 10 frequencies sum. Time elapsed: {time.time() - start_time}")

top_ten_frequencies_percentage = round(((top_ten_frequencies_sum / len(users)) * 100), 2)
print(f"7: got top 10 frequencies percentage. Time elapsed: {time.time() - start_time}")

average_submissions_per_user = round(len(users) / len(set(users)), 2)
print(f"8: got average submissions per user. Time elapsed: {time.time() - start_time}")

This operation is very slow. Here is my output:

1: got users list. Time elapsed: 0.41695237159729004
2: got one-time users list. Time elapsed: 48.26731848716736
3: got returning users list. Time elapsed: 101.88410639762878
4: calculated frequencies list. Time elapsed: 104.39784860610962
5: got sorted frequencies list. Time elapsed: 104.39850783348083
6: got top 10 frequencies sum. Time elapsed: 104.39853930473328
7: got top 10 frequencies percentage. Time elapsed: 104.39856457710266
8: got average submissions per user. Time elapsed: 104.4005241394043

As you can see the list comprehensions are taking the most time. Can someone explain to me:

Why it's so slow in terms of time complexity.
Whether collections.Counter() will be a better choice and how best to apply it here.

Thank you!

abc · Accepted Answer

You can improve by using a Counter, in 2. for each element you are iterating the whole list, and you are doing this multiple times for the same user if an user occurs more than once.
Note that when you use users.count(user) you iterate all the list of users to count how many times the user occurs. This means quadratic complexity with respect to the length of the list.
Using a counter, you can solve it in linear complexity.
Also, in 4. you are iterating and counting again, while you could just remove the ones you just computed from the whole users.
Example.

>>> one_time_users = {user for user,cnt in Counter(users).items() if cnt == 1}
{'UserC', 'UserD'}
>>> returning_users = set(users) - one_time_users
>>> returning_users
{'UserB', 'UserA'}

or more directly

one_time_users, returning_users  = [], []
for user,cnt in Counter(users).items():
   if cnt==1:
      one_time_users.append(user)
   else:
      returning_users.append(user)

Here a comparison of l.count(el) vs a Counter(l).

>>> l = random.choices(range(500), k=60000)
>>> timeit.timeit('[el for el in l if l.count(el) == 1]',setup='from __main__ import l',number=1)
71.70409394335002
>>> timeit.timeit('[el for el,cnt in Counter(l).items() if cnt == 1]',setup='from __main__ import l, Counter',number=1)
0.005492186639457941

Counting elements in a list that only occur once: how do I optimise my performance? It's very slow

Answers (2)

Related Questions

Counting elements in a list that only occur once: how do I optimise my performance? It&#39;s very slow

Answers (2)

Related Questions

Counting elements in a list that only occur once: how do I optimise my performance? It's very slow