What is the fastest way to combine and sort lists of tuples?

Question

I have a list of lists of tuples. Each tuple has the form (string,int), e.g.

lst = list()
lst.append([('a',5),('c',10),('d',3),('b',1)])
lst.append([('c',14),('f',4),('b',1)])
lst.append([('d',22),('f',2)])

Think of the int's as counts of each string in different blocks of text.

What I need to do is produce a list of top-N occurring strings together with their cumulative counts. So in the example above, a appears 5 times, b appears twice, c appears 24 times etc. If N=2, then I would have to produce either a pair of parallel lists ['d','c'] and [25,24] or a list of tuples [('d',25),('c',24)]. I need to do it as quickly as possible. My machine has lots of RAM so memory is not an issue.

I have this implementation:

import numpy as np
def getTopN(lst,N):

    sOut = []
    cOut = []

    for l in lst:
        for tpl in l:
            s = tpl[0]
            c = tpl[1]

            try:
                i = sOut.index(s)
                cOut[i] += c
            except:
                sOut.append(s)
                cOut.append(c)

    sIndAsc = np.argsort(cOut).tolist()
    sIndDes = sIndAsc[::-1]
    cOutDes = [cOut[sir] for sir in sIndDes]
    sOutDes = [sOut[sir] for sir in sIndDes]

    return sOutDes[0:N],cOutDes[0:N]

There's gotta be a better way, but what would it be?

tobias_k · Accepted Answer

Use collections.Counter:

import collections
c = collections.Counter()
for x in lst:
    c.update(dict(x))
print(c.most_common(2))

Output:

[('d', 25), ('c', 24)]

The Counter is basically a dictionary with some added functionality, so looking up a value and adding to it's current count is really fast. dict(x) will just turn the list of tuples into a regular dict, mapping strings to numbers, then the update method of Counter will add those counts (instead of just overwriting the values, as a regular dict would do).

Alternatively, a more manual approach using a defaultdict:

c = collections.defaultdict(int)
for x, y in (t for x in lst for t in x):
    c[x] += y
return [(k, c[k]) for k in sorted(c, key=c.get, reverse=True)][:2]

As pointed out by John in comments, the defaultdict is indeed much faster:

In [2]: %timeit with_counter()
10000 loops, best of 3: 17.3 µs per loop
In [3]: %timeit with_dict()
100000 loops, best of 3: 4.97 µs per loop

What is the fastest way to combine and sort lists of tuples?

Answers (2)

Related Questions