Gere
Gere

Reputation: 12687

How to remove the least frequent element from a Counter in Python the fastest way?

I'd like to implement a Counter which drops the least frequent element when the counter's size going beyond some threshold. For that I need to remove the least frequent element.

What is the fastest way to do that in Python?

I know counter.most_common()[-1], but it creates a whole list and seems slow when done extensively? Is there a better command (or maybe a different data structure)?

Upvotes: 3

Views: 1873

Answers (3)

poke
poke

Reputation: 387557

If you only want to get the least common value, then the most efficient way to handle this is to simply get the minimum value from the counter (dictionary).

Since you can only say whether a value is the lowest, you actually need to look at all items, so a time complexity of O(n) is really the lowest we can get. However, we do not need to have a linear space complexity, as we only need to remember the lowest value, and not all of them. So a solution that works like most_common() in reverse is too much for us.

In this case, we can simply use min() with a custom key function here:

>>> c = Counter('foobarbazbar')
>>> c
Counter({'a': 3, 'b': 3, 'o': 2, 'r': 2, 'f': 1, 'z': 1})

>>> k = min(c, key=lambda x: c[x])
>>> del c[k]
>>> c
Counter({'a': 3, 'b': 3, 'o': 2, 'r': 2, 'z': 1})

Of course, since dictionaries are unordered, you do not get any influence on which of the lowest values is removed that way in case there are multiple with the same lowest occurrence.

Upvotes: 0

dawg
dawg

Reputation: 103754

Since your stated goal is to remove items in the counter below a threshold, just reverse the counter (so the values becomes a list of keys with that value) and then remove the keys in the counter below the threshold.

Example:

>>> c=Counter("aaaabccadddefeghizkdxxx")
>>> c
Counter({'a': 5, 'd': 4, 'x': 3, 'c': 2, 'e': 2, 'b': 1, 'g': 1, 'f': 1, 'i': 1, 'h': 1, 'k': 1, 'z': 1})

counts={}
for k, v in c.items():
    counts.setdefault(v, []).append(k)

tol=2   
for k, v in counts.items():
    if k<=tol:
        c=c-Counter({}.fromkeys(v, k))
>>> c
Counter({'a': 5, 'd': 4, 'x': 3})

In this example, all counts less than or equal to 2 are removed.

Or, just recreate the counter with a comparison to your threshold value:

>>> c
Counter({'a': 5, 'd': 4, 'x': 3, 'c': 2, 'e': 2, 'b': 1, 'g': 1, 'f': 1, 'i': 1, 'h': 1, 'k': 1, 'z': 1})
>>> Counter({k:v for k,v in c.items() if v>tol})
Counter({'a': 5, 'd': 4, 'x': 3})

Upvotes: 0

Łukasz Rogalski
Łukasz Rogalski

Reputation: 23203

You may implement least_common by borrowing implementation of most_common and performing necessary changes.

Refer to collections source in Py2.7:

def most_common(self, n=None):
    '''List the n most common elements and their counts from the most
    common to the least.  If n is None, then list all element counts.

    >>> Counter('abcdeabcdabcaba').most_common(3)
    [('a', 5), ('b', 4), ('c', 3)]

    '''
    # Emulate Bag.sortedByCount from Smalltalk
    if n is None:
        return sorted(self.iteritems(), key=_itemgetter(1), reverse=True)
    return _heapq.nlargest(n, self.iteritems(), key=_itemgetter(1))

To change it in order to retrieve least common we need just a few adjustments.

import collections
from operator import itemgetter as _itemgetter
import heapq as _heapq


class MyCounter(collections.Counter):
    def least_common(self, n=None):
        if n is None:
            return sorted(self.iteritems(), key=_itemgetter(1), reverse=False)  # was: reverse=True
        return _heapq.nsmallest(n, self.iteritems(), key=_itemgetter(1))  # was _heapq.nlargest

Tests:

c = MyCounter("abbcccddddeeeee")
assert c.most_common() == c.least_common()[::-1]
assert c.most_common()[-1:] == c.least_common(1)

Upvotes: 1

Related Questions