Abhishek Thakur
Abhishek Thakur

Reputation: 17015

removing least common elements from Counter

Is there any "faster way" to remove key, value pairs from Counter where value is less than certain value?

I've done the following:

counter_dict = {k:v for k, v in counter_dict.items() if v > 5}

Upvotes: 4

Views: 6916

Answers (2)

Anshul Goyal
Anshul Goyal

Reputation: 76857

The major issue with the current code is the call to .items, which will create a list of all items:

One optimization could be to use Counter.iteritems instead of .items, to save the penalty of creating a list and iterating through it again.

>>> from collections import Counter
>>> cnt = Counter("asbdasdbasdbadaasasdasadsa")
>>> {k:v for k,v in cnt.iteritems() if v > 5}
{'a': 10, 's': 7, 'd': 6}

Another optimization could be to not call the .items method, and instead iterate on the keys and access the values using the key:

>>> from collections import Counter
>>> cnt = Counter("asbdasdbasdbadaasasdasadsa")
>>> {k:cnt[k] for k in cnt if cnt[k] > 5}
{'a': 10, 's': 7, 'd': 6}

If we try to measure the difference with %timeit in ipython, using a sample Counter with your mentioned if condition, iteritems wins hands down:

In [1]: import random

In [2]: from collections import Counter

In [3]: MILLION = 10**6

In [4]: cnt = Counter(random.randint(0, MILLION) for _ in xrange(MILLION))

In [5]: %timeit {k:v for k, v in cnt.iteritems() if v < 5}
10 loops, best of 3: 140 ms per loop

In [6]: %timeit {k:v for k, v in cnt.items() if v**2 < 5}
1 loops, best of 3: 290 ms per loop

In [7]: %timeit {k:cnt[k] for k in cnt if cnt[k] < 5}
1 loops, best of 3: 272 ms per loop

With change of conditions:

In [8]: %timeit {k:v for k, v in cnt.iteritems() if v > 5}
10 loops, best of 3: 87 ms per loop

In [9]: %timeit {k:v for k, v in cnt.items() if v > 5}
1 loops, best of 3: 186 ms per loop

In [10]: %timeit {k:cnt[k] for k in cnt if cnt[k] > 5}
10 loops, best of 3: 153 ms per loop

Upvotes: 4

jsbueno
jsbueno

Reputation: 110218

So you are probably better not recreating the whole dictionary everytime:

to_remove = set()
for key, value in counter_dict.viewitems():
   if value <= 5:
      to_remove.add(key)

for key in to_remove:
    del counter_dict[key]

Unfolding a "for" statement into more lines not necessarily means less perfomance. Although maybe there is not much performance gain in this case, the memory consumption at least should go way down.

Another option is to make your "counter_dict" a smarter object that knows not to yield its values when the count is <= 5 - that would make this step "lazy".

something along (but not just - the correct thing is to implement this using ABC metaclasses - collections.MutableMapping

class MyDict(dict):
   def __init__(*args, **kw):
       self.threshold = None
       super(MyDict,self).__init__(*args, **kw)
   def __getitem__(self, key):
       value = super(MyDict, self).__getitem__(key)
       if self.threshold is None or key > self.threshold:
          return value
       raise ItemError
   # the same for __contains__ and other interesting methods

And you change the "threshold" attribute object in your dict when it should start filtering. This is more or less overdoing it - since your checking will still be done, just with diluted time - but maybe when consuming objects you are on a async/multithread workload that could make it in parallel - but if you need different thresholds in different parts of the code, it could be a nice thing to have.

Upvotes: 0

Related Questions