Reputation: 5927
I have a list of lists of tuples. Each tuple has the form (string,int)
, e.g.
lst = list()
lst.append([('a',5),('c',10),('d',3),('b',1)])
lst.append([('c',14),('f',4),('b',1)])
lst.append([('d',22),('f',2)])
Think of the int
's as counts of each string in different blocks of text.
What I need to do is produce a list of top-N
occurring strings together with their cumulative counts. So in the example above, a
appears 5 times, b
appears twice, c
appears 24 times etc. If N=2
, then I would have to produce either a pair of parallel lists ['d','c']
and [25,24]
or a list of tuples [('d',25),('c',24)]
. I need to do it as quickly as possible. My machine has lots of RAM so memory is not an issue.
I have this implementation:
import numpy as np
def getTopN(lst,N):
sOut = []
cOut = []
for l in lst:
for tpl in l:
s = tpl[0]
c = tpl[1]
try:
i = sOut.index(s)
cOut[i] += c
except:
sOut.append(s)
cOut.append(c)
sIndAsc = np.argsort(cOut).tolist()
sIndDes = sIndAsc[::-1]
cOutDes = [cOut[sir] for sir in sIndDes]
sOutDes = [sOut[sir] for sir in sIndDes]
return sOutDes[0:N],cOutDes[0:N]
There's gotta be a better way, but what would it be?
Upvotes: 1
Views: 148
Reputation: 82929
Use collections.Counter
:
import collections
c = collections.Counter()
for x in lst:
c.update(dict(x))
print(c.most_common(2))
Output:
[('d', 25), ('c', 24)]
The Counter
is basically a dictionary with some added functionality, so looking up a value and adding to it's current count is really fast. dict(x)
will just turn the list of tuples into a regular dict, mapping strings to numbers, then the update
method of Counter
will add those counts (instead of just overwriting the values, as a regular dict would do).
Alternatively, a more manual approach using a defaultdict
:
c = collections.defaultdict(int)
for x, y in (t for x in lst for t in x):
c[x] += y
return [(k, c[k]) for k in sorted(c, key=c.get, reverse=True)][:2]
As pointed out by John in comments, the defaultdict
is indeed much faster:
In [2]: %timeit with_counter()
10000 loops, best of 3: 17.3 µs per loop
In [3]: %timeit with_dict()
100000 loops, best of 3: 4.97 µs per loop
Upvotes: 6
Reputation: 2825
Another option, using numpy
:
# make a flattened numpy version of the list
lst_np = np.asarray([item for sublist in lst for item in sublist])
# split into the two columns
chars = lst_np[:,0]
counts = lst_np[:,1].astype('int')
# get unique characters, and compute total counts for each
[unique_chars, unique_inds] = np.unique(chars, return_inverse=True)
unique_counts = np.asarray([np.sum(counts[unique_inds==x])
for x in range(len(unique_chars))])
This will get you counts (unique_counts
) for each unique character (unique_chars
) in the lists, not just the top N
. This should be pretty fast, but might be heavy on memory.
Upvotes: 0