raven
raven

Reputation: 527

Pythonic way to count the most frequent items with duplicates in Counter, separately by number of elements?

I'm interested in a neat (+ optimal) way to solve the following problem, as I'm curious how to execute this in a Pythonic way. While doing a separate task, I attempted to make a collections.Counter dictionary with arbitrary elements:

d = Counter({('A','C'): 4, ('B','D','E'): 3, ..., ('A','G','V','X','Z'): 1, ('L','Z'): 1})

Here, each element is a tuple, consisting of 2 or more characters. My goal is to create a list with the most frequent items with duplicates uniquely by the number of elements in a tuple. For example, the solution may look like the following:

d_sample = Counter({('A', 'C'): 4, ('B', 'D', 'E'): 3, ('A' ,'D'): 3, ('C', 'D', 'E'): 3, 
                    ('A', 'B', 'C', 'D', 'E'): 2, ('A', 'C', 'D', 'E'): 1, 
                    ('B', 'C', 'D', 'E'): 1, ('D', 'E'): 1})

result = [('A', 'C'), ('B', 'D', 'E'), ('C', 'D', 'E'), ('A', 'B', 'C', 'D', 'E'),
         ('A', 'C', 'D', 'E'), ('B', 'C', 'D', 'E')]

Here, ('A', 'C') is selected, as it is the tuple with two elements with the highest count. ('B', 'D', 'E') and ('C', 'D', 'E') are both selected as they are the tuples with three elements with the highest count. Same goes for other elements in the result list.

The method I thought of was using a for loop to extract several lists from d, separated by number of elements: for i in range(2, # maximum elements using another for loop). From there, I counted the item(s) with the highest frequency using another for loop and max().

# pseudo-example:

maxval = 2

for x in d:
   if len(x[0]) > maxval:
       maxval = len(x[0])

counter_list = [[] * n for n in range(maxval-2)]

for x in d:
   counter_list[len(x[0])-2].append(x)

## selecting max per list

As elaborated, this doesn't seem Pythonic and optimal at all. Any insights to deliver the task more optimally would be appreciated.

Upvotes: 2

Views: 336

Answers (1)

wim
wim

Reputation: 362557

This is O(n) and I don't think it will be possible to get asymptotically any better than that.

highest_count_for_length = {}
result = []
for tup, count in d.items():
    try:
        if highest_count_for_length[len(tup)] == count:
            result.append(tup)
    except KeyError:
        # we haven't seen this length yet
        highest_count_for_length[len(tup)] = count
        result.append(tup)

It does rely on the fact that the input is already ordered by value.

Upvotes: 2

Related Questions