Reputation: 137
I have a list with more than 100 millions of tuples, with key-value elements like this:
list_a = [(1,'a'), (2,'b'), (1,'a'), (3,'b'), (3,'b'), (1,'a')]
I need to output a second list like this:
list_b = [(1,'a', 3), (2, 'b', 1), (3, 'b', 2) ]
Last element in a tuple is the count of duplicates in the list for such tuple. Order in list_b doesn't matter.
Then, I wrote this code:
import collections
list_b = []
for e, c in collections.Counter(list_a).most_common():
list_b.append("{}, {}, {}".format(e[0], e[1], c))
Running with 1000 tuples it last 2 seconds approximately... figure out how long will take with more that 100 millions. Any idea to speed it up?
Upvotes: 3
Views: 639
Reputation: 13858
Your bottle neck is using list.append
method, since it's running on native python instead of the innate C code, it'll perform much slower.
You can opt to use list comprehension instead and it'll be much faster:
c = Counter(list_a)
result = [(*k, v) for k, v in c.items()]
Ran this on a 1000 item list on my machine, it was pretty quick.
Upvotes: 1