Sidewinder
Sidewinder

Reputation: 389

Counting elements in a list that only occur once: how do I optimise my performance? It's very slow

I have a large list that contains usernames (about 60,000 strings). Each username represents a submission. Some users have made only one submission i.e. they are "one-time users", so their username appears only once in this list. Others have made multiple submission (returning users) so their username can appear many times in this list. I want to count how many of these one-time users there are and get some stats based on that. Here are the variables I'm currently grabbing:

import time

start_time = time.time()

users = ["UserA", "UserB", "UserC", "UserA", "UserA", "UserA", "UserB", "UserB", "UserD"] # ...just a sample, this goes up to ~60,000 elements
print(f"1: got users list. Time elapsed: {time.time() - start_time}")

one_time_users = [user for user in users if users.count(user) == 1]
print(f"2: got one-time users list. Time elapsed: {time.time() - start_time}")

returning_users = [user for user in users if users.count(user) != 1]
print(f"3: got returning users list. Time elapsed: {time.time() - start_time}")

frequencies = [users.count(user) for user in set(users)]
print(f"4: calculated frequencies list. Time elapsed: {time.time() - start_time}")

sorted_frequencies = sorted(frequencies, reverse=True) # Descending order, largest first
print(f"5: got sorted frequencies list. Time elapsed: {time.time() - start_time}")

top_ten_frequencies_sum = sum(sorted_frequencies[:10])
print(f"6: got top 10 frequencies sum. Time elapsed: {time.time() - start_time}")

top_ten_frequencies_percentage = round(((top_ten_frequencies_sum / len(users)) * 100), 2)
print(f"7: got top 10 frequencies percentage. Time elapsed: {time.time() - start_time}")

average_submissions_per_user = round(len(users) / len(set(users)), 2)
print(f"8: got average submissions per user. Time elapsed: {time.time() - start_time}")

This operation is very slow. Here is my output:

1: got users list. Time elapsed: 0.41695237159729004
2: got one-time users list. Time elapsed: 48.26731848716736
3: got returning users list. Time elapsed: 101.88410639762878
4: calculated frequencies list. Time elapsed: 104.39784860610962
5: got sorted frequencies list. Time elapsed: 104.39850783348083
6: got top 10 frequencies sum. Time elapsed: 104.39853930473328
7: got top 10 frequencies percentage. Time elapsed: 104.39856457710266
8: got average submissions per user. Time elapsed: 104.4005241394043

As you can see the list comprehensions are taking the most time. Can someone explain to me:

  1. Why it's so slow in terms of time complexity.
  2. Whether collections.Counter() will be a better choice and how best to apply it here.

Thank you!

Upvotes: 1

Views: 122

Answers (2)

Cireo
Cireo

Reputation: 4427

As mentioned in your own comment, Counter is significantly faster here. You can see from your own timing that creating a set of the results takes around 10ms to complete (#8->#9), which is roughly the time Counter will take as well.

With counter you look at at each of the N elements once, and then at each unique element (at most N) once.

When you use .count() you iterate through the entire list (a fast implementation, but still the entire list). You do this for every element, so you look at each of N elements N times.

Every time your list gets 1000x bigger you require 1000x the time for the Counter method, but 1000000x for .count versions.

Upvotes: 1

abc
abc

Reputation: 11929

You can improve by using a Counter, in 2. for each element you are iterating the whole list, and you are doing this multiple times for the same user if an user occurs more than once.
Note that when you use users.count(user) you iterate all the list of users to count how many times the user occurs. This means quadratic complexity with respect to the length of the list.
Using a counter, you can solve it in linear complexity.
Also, in 4. you are iterating and counting again, while you could just remove the ones you just computed from the whole users.
Example.

>>> one_time_users = {user for user,cnt in Counter(users).items() if cnt == 1}
{'UserC', 'UserD'}
>>> returning_users = set(users) - one_time_users
>>> returning_users
{'UserB', 'UserA'}

or more directly

one_time_users, returning_users  = [], []
for user,cnt in Counter(users).items():
   if cnt==1:
      one_time_users.append(user)
   else:
      returning_users.append(user)

Here a comparison of l.count(el) vs a Counter(l).

>>> l = random.choices(range(500), k=60000)
>>> timeit.timeit('[el for el in l if l.count(el) == 1]',setup='from __main__ import l',number=1)
71.70409394335002
>>> timeit.timeit('[el for el,cnt in Counter(l).items() if cnt == 1]',setup='from __main__ import l, Counter',number=1)
0.005492186639457941

Upvotes: 2

Related Questions