Reputation: 12687
I'd like to implement a Counter which drops the least frequent element when the counter's size going beyond some threshold. For that I need to remove the least frequent element.
What is the fastest way to do that in Python?
I know counter.most_common()[-1]
, but it creates a whole list and seems slow when done extensively? Is there a better command (or maybe a different data structure)?
Upvotes: 3
Views: 1873
Reputation: 387557
If you only want to get the least common value, then the most efficient way to handle this is to simply get the minimum value from the counter (dictionary).
Since you can only say whether a value is the lowest, you actually need to look at all items, so a time complexity of O(n) is really the lowest we can get. However, we do not need to have a linear space complexity, as we only need to remember the lowest value, and not all of them. So a solution that works like most_common()
in reverse is too much for us.
In this case, we can simply use min()
with a custom key function here:
>>> c = Counter('foobarbazbar')
>>> c
Counter({'a': 3, 'b': 3, 'o': 2, 'r': 2, 'f': 1, 'z': 1})
>>> k = min(c, key=lambda x: c[x])
>>> del c[k]
>>> c
Counter({'a': 3, 'b': 3, 'o': 2, 'r': 2, 'z': 1})
Of course, since dictionaries are unordered, you do not get any influence on which of the lowest values is removed that way in case there are multiple with the same lowest occurrence.
Upvotes: 0
Reputation: 103754
Since your stated goal is to remove items in the counter below a threshold, just reverse the counter (so the values becomes a list of keys with that value) and then remove the keys in the counter below the threshold.
Example:
>>> c=Counter("aaaabccadddefeghizkdxxx")
>>> c
Counter({'a': 5, 'd': 4, 'x': 3, 'c': 2, 'e': 2, 'b': 1, 'g': 1, 'f': 1, 'i': 1, 'h': 1, 'k': 1, 'z': 1})
counts={}
for k, v in c.items():
counts.setdefault(v, []).append(k)
tol=2
for k, v in counts.items():
if k<=tol:
c=c-Counter({}.fromkeys(v, k))
>>> c
Counter({'a': 5, 'd': 4, 'x': 3})
In this example, all counts less than or equal to 2 are removed.
Or, just recreate the counter with a comparison to your threshold value:
>>> c
Counter({'a': 5, 'd': 4, 'x': 3, 'c': 2, 'e': 2, 'b': 1, 'g': 1, 'f': 1, 'i': 1, 'h': 1, 'k': 1, 'z': 1})
>>> Counter({k:v for k,v in c.items() if v>tol})
Counter({'a': 5, 'd': 4, 'x': 3})
Upvotes: 0
Reputation: 23203
You may implement least_common
by borrowing implementation of most_common
and performing necessary changes.
Refer to collections
source in Py2.7:
def most_common(self, n=None):
'''List the n most common elements and their counts from the most
common to the least. If n is None, then list all element counts.
>>> Counter('abcdeabcdabcaba').most_common(3)
[('a', 5), ('b', 4), ('c', 3)]
'''
# Emulate Bag.sortedByCount from Smalltalk
if n is None:
return sorted(self.iteritems(), key=_itemgetter(1), reverse=True)
return _heapq.nlargest(n, self.iteritems(), key=_itemgetter(1))
To change it in order to retrieve least common we need just a few adjustments.
import collections
from operator import itemgetter as _itemgetter
import heapq as _heapq
class MyCounter(collections.Counter):
def least_common(self, n=None):
if n is None:
return sorted(self.iteritems(), key=_itemgetter(1), reverse=False) # was: reverse=True
return _heapq.nsmallest(n, self.iteritems(), key=_itemgetter(1)) # was _heapq.nlargest
Tests:
c = MyCounter("abbcccddddeeeee")
assert c.most_common() == c.least_common()[::-1]
assert c.most_common()[-1:] == c.least_common(1)
Upvotes: 1