Reputation: 17015
Is there any "faster way" to remove key, value pairs from Counter where value is less than certain value?
I've done the following:
counter_dict = {k:v for k, v in counter_dict.items() if v > 5}
Upvotes: 4
Views: 6916
Reputation: 76857
The major issue with the current code is the call to .items
, which will create a list of all items:
One optimization could be to use Counter.iteritems
instead of .items
, to save the penalty of creating a list and iterating through it again.
>>> from collections import Counter
>>> cnt = Counter("asbdasdbasdbadaasasdasadsa")
>>> {k:v for k,v in cnt.iteritems() if v > 5}
{'a': 10, 's': 7, 'd': 6}
Another optimization could be to not call the .items
method, and instead iterate on the keys and access the values using the key:
>>> from collections import Counter
>>> cnt = Counter("asbdasdbasdbadaasasdasadsa")
>>> {k:cnt[k] for k in cnt if cnt[k] > 5}
{'a': 10, 's': 7, 'd': 6}
If we try to measure the difference with %timeit
in ipython, using a sample Counter with your mentioned if condition, iteritems
wins hands down:
In [1]: import random
In [2]: from collections import Counter
In [3]: MILLION = 10**6
In [4]: cnt = Counter(random.randint(0, MILLION) for _ in xrange(MILLION))
In [5]: %timeit {k:v for k, v in cnt.iteritems() if v < 5}
10 loops, best of 3: 140 ms per loop
In [6]: %timeit {k:v for k, v in cnt.items() if v**2 < 5}
1 loops, best of 3: 290 ms per loop
In [7]: %timeit {k:cnt[k] for k in cnt if cnt[k] < 5}
1 loops, best of 3: 272 ms per loop
With change of conditions:
In [8]: %timeit {k:v for k, v in cnt.iteritems() if v > 5}
10 loops, best of 3: 87 ms per loop
In [9]: %timeit {k:v for k, v in cnt.items() if v > 5}
1 loops, best of 3: 186 ms per loop
In [10]: %timeit {k:cnt[k] for k in cnt if cnt[k] > 5}
10 loops, best of 3: 153 ms per loop
Upvotes: 4
Reputation: 110218
So you are probably better not recreating the whole dictionary everytime:
to_remove = set()
for key, value in counter_dict.viewitems():
if value <= 5:
to_remove.add(key)
for key in to_remove:
del counter_dict[key]
Unfolding a "for" statement into more lines not necessarily means less perfomance. Although maybe there is not much performance gain in this case, the memory consumption at least should go way down.
Another option is to make your "counter_dict" a smarter object that knows not to yield its values when the count is <= 5 - that would make this step "lazy".
something along (but not just - the correct thing is to implement this using ABC metaclasses - collections.MutableMapping
class MyDict(dict):
def __init__(*args, **kw):
self.threshold = None
super(MyDict,self).__init__(*args, **kw)
def __getitem__(self, key):
value = super(MyDict, self).__getitem__(key)
if self.threshold is None or key > self.threshold:
return value
raise ItemError
# the same for __contains__ and other interesting methods
And you change the "threshold" attribute object in your dict when it should start filtering. This is more or less overdoing it - since your checking will still be done, just with diluted time - but maybe when consuming objects you are on a async/multithread workload that could make it in parallel - but if you need different thresholds in different parts of the code, it could be a nice thing to have.
Upvotes: 0