Christophe
Christophe

Reputation: 507

how to optimally count elements in a python list

This is almost the same question than here, except that I am asking about the most efficient solution for a sorted result.

I have a list (about 10 integers randomly between 0 and 12), for example:

the_list = [5, 7, 6, 5, 5, 4, 4, 7, 5, 4]

I want to create a function that returns a list of tuples (item, counts) ordered by the first element, for example

output = [(4, 3), (5, 4), (6, 1), (7, 2)]

So far I have used:

def dupli(the_list):
    return [(item, the_list.count(item)) for item in sorted(set(the_list))]

But I call this function almost a millon time and I need to make it as fast as I (python) can. Therefore my question: How to make this function less time comsuming? (what about memory?)

I have played around a bit, but nothing obvious came up:

from timeit import Timer as T
number=10000
setup = "the_list=[5, 7, 6, 5, 5, 4, 4, 7, 5, 4]"

stmt = "[(item, the_list.count(item)) for item in sorted(set(the_list))]"
T(stmt=stmt, setup=setup).timeit(number=number)

Out[230]: 0.058799982070922852

stmt = "L = []; \nfor item in sorted(set(the_list)): \n    L.append((item, the_list.count(item)))"
T(stmt=stmt, setup=setup).timeit(number=number)

Out[233]: 0.065041065216064453

stmt = "[(item, the_list.count(item)) for item in set(sorted(the_list))]"
T(stmt=stmt, setup=setup).timeit(number=number)

Out[236]: 0.098351955413818359

Thanks
Christophe

Upvotes: 6

Views: 11380

Answers (6)

PaulMcG
PaulMcG

Reputation: 63802

itertools.groupby is perfect for this:

>>> from itertools import groupby
>>> the_list = [5, 7, 6, 5, 5, 4, 4, 7, 5, 4]
>>> gb = groupby(sorted(the_list))
>>> print [(i,len(list(j))) for i,j in gb]
[(4, 3), (5, 4), (6, 1), (7, 2)]

Upvotes: 0

Steven Rumbalski
Steven Rumbalski

Reputation: 45562

Change where you sort for a savings of about 20%.

Change this:

def dupli(the_list):
    return [(item, the_list.count(item)) for item in sorted(set(the_list))]

To this:

def dupli(the_list):
    count = the_list.count # this optimization added courtesy of Sven's comment
    result = [(item, count(item)) for item in set(the_list)]
    result.sort()
    return result

The reason this is faster is that the sorted iterator must create a temporary list, whereas sorting the result sorts in place.

edit: Here's another approach that is 35% faster than your original:

def dupli(the_list):
    counts = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    for n in the_list:
        counts[n] += 1
    return [(i, counts[i]) for i in (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) if counts[i]]

Note: You may want to randomize the values for the_list. My final version of dupli tests even faster with other random data sets (import random; the_list=[random.randint(0,12) for i in xrange(10)])

Upvotes: 4

Wolph
Wolph

Reputation: 80111

This seems fairly optimal in terms of space and speed:

def dupli2(list_):                                    
    dict_ = {}                                       
    for item in list_:                               
        dict_[item] = dict_.get(item, 0) + 1         
    return sorted(dict_.items())                    

Or this:

def dupli3(list_):                                            
    last = None                                               
    list_ = sorted(list_)                                     

    i = 0                                                     
    for item in list_:                                        
        if item != last and last is not None:                 
            yield last, i                                     
            i = 0                                             
        i += 1                                                
        last = item                                           

    yield last, i 

Not sure about the speed though. For that I'd recommend that you either do it in C or use Psyco ;)

With Psyco:

In [33]: %timeit list(dupli3(test.the_list))
100000 loops, best of 3: 6.46 us per loop

In [34]: %timeit list(dupli2(test.the_list))
100000 loops, best of 3: 2.37 us per loop

In [35]: %timeit list(dupli(test.the_list))
100000 loops, best of 3: 2.7 us per loop

Upvotes: 0

John Machin
John Machin

Reputation: 83032

Taking advantage of the qualification "between 0 and 12":

>>> the_list = [5, 7, 6, 5, 5, 4, 4, 7, 5, 4]
>>> answer1 = [0] * 13
>>> for i in the_list:
...    answer1[i] += 1
...
>>> answer1
[0, 0, 0, 0, 3, 4, 1, 2, 0, 0, 0, 0, 0]
>>> # You might be able to use that as-is:
...
>>> for i, v in enumerate(answer1):
...     if v: print i, v
...
4 3
5 4
6 1
7 2
>>> # Otherwise you can build the list that you specified:
...
>>> answer2 = [(i, v) for i, v in enumerate(answer1) if v]
>>> answer2
[(4, 3), (5, 4), (6, 1), (7, 2)]
>>>

Upvotes: 2

Colin
Colin

Reputation: 10820

It might be faster to write your own function that counts the numbers in one pass through the list. You're calling the count function for every number in the set, and each of those calls requires a pass through the list.

counts = {}
for n in the_list:
    if n not in counts:
        counts[n] = 0
    counts[n] += 1
sorted(counts.items())

Upvotes: 0

Karl Knechtel
Karl Knechtel

Reputation: 61654

I would try:

from collections import defaultdict
output = defaultdict(lambda: 0)
for item in the_list: output[item] += 1
return sorted(output.items())

Upvotes: 3

Related Questions