prin
prin

Reputation: 35

How to find all of the most common elements in a python list (order alphabetically in case of tie)?

How to find all of the most common elements in a python list (order alphabetically in case of tie) Suppose you have a list 'l = ['b', 'b', 'a', 'a', 'c']' the output should be ''' a 2, b 2 ''' I need a efficient solution of this in terms of time complexity. Thank you.

Upvotes: 0

Views: 765

Answers (3)

JelloSquirrel
JelloSquirrel

Reputation: 89

Using your code and Counter we get the following:

from collections import Counter

def most_frequent(List):
    occurence_count = Counter(List)
    return occurence_count.most_common()

l = ['b', 'b', 'b', 'a', 'a', 'c']


print(sorted(most_frequent(l)))

We get the output:

[('a', 2), ('b', 3), ('c', 1)]

With sorted we sort the items automatically by alphabet.

You can double check that its sorting alphabetically with the following code:

tuplesInList = (most_frequent(l))
#to sort tuples within lists we use an anonymous function(lambda)
def sort_tuple_vals(my_tup):
    my_tup.sort(key=lambda x:x[0])
    return my_tup

This sorts the tuples by the first element in the tuple

print(sort_tuple_vals(tuplesInList))

gets the output

[('a', 2), ('b', 3), ('c', 1)]

If you want to sort by occurrence as opposed to alphabetically, then sort by alphabetically in case of a tie, the following code should work. We first sort the tuples by the number of occurrences with lambda x: x[1]

tuplesInList = (most_frequent(l))
#to sort tuples within lists we use an anonymous function(lambda)
def sortTupsbyOccurance(my_tup):
    my_tup.sort(key=lambda x:x[1])
    print(my_tup)
    return my_tup

tuple=(sortTupsbyOccurance(tuplesInList))
print(tuple)

from an initial list l = ['b', 'b', 'b', 'a', 'a', 'a', 'c'] we get the output:

[('c', 1), ('b', 3), ('a', 3)]

using this I believe we solve the sorting by alphabet issue. if tuple at position n's second value is equal to the second value of tuple at positon n+1 we note the equality and go to our next if statement with those values. Because we start at position 0 we are sure to not skip any potential matches.

#set the n value for  range 0, n to the number of tuple entries in the list.
for n in range(0,2):
        if tuple[n][1]==tuple[n+1][1]:
           print("there is an equality")
           var1=str(tuple[n][0])
           var2= str(tuple[n+1][0])

           if var1 > var2:
               print(var1, var2)
               print("this is true")
               # if tuple at position n's 1st value (alphabeticaly) is greater than tuple n+1's first value then we switch them.
              
               tuple[n], tuple[n+1] = tuple[n+1], tuple[n]
               print(tuple)

from initial list l with 3 "a"s and 3 "b"s we get the output:

[('c', 1), ('a', 3), ('b', 3)]

Upvotes: -1

JonSG
JonSG

Reputation: 13097

I wanted to see if I could make @riccardo-bucco's answer any faster (I could not) but I will show you an alternative that is basically the same speed that I thought might be faster.

To be clear, I feel the top answer is from @riccardo_bucco as it is easier to follow and is just as fast. Use it.

I was hoping that not having to scan the counter twice would more than make up for resetting the largest_with_ties list, but it did not.

def jonsg(data_in):
    largest_with_ties = [(None, 0)]
    for item in collections.Counter(data_in).items():
        diff = item[1] - largest_with_ties[0][1]
        if diff < 0:
            continue
        if diff > 0:
            largest_with_ties.clear()
        largest_with_ties.append(item)
    return sorted(largest_with_ties)

Testing the timings I will use the words from "The Complete Works of William Shakespeare" from Project Guttenberg. You can get that here (5.5m): https://www.gutenberg.org/files/100/100-0.txt

Note, I have slightly altered Riccardo Bucco's answer to return a tuple not that it made a performance difference.

import timeit

setup = """
import collections

#data_in = ['b', 'b', 'a', 'a', 'c']
with open("shakespeare.txt", "r", encoding="utf-8") as file_in:
    data_in = [word.strip().lower() for line in file_in for word in line.split()]

def riccardo_bucco(data_in):
    counts = collections.Counter(data_in) # O(n)
    largest = max(counts.values()) # O(n)
    largest_with_ties = [item for item in counts.items() if item[1] == largest] # O(n)
    return sorted(largest_with_ties)

def jonsg(data_in):
    largest_with_ties = [(None, 0)]
    for item in collections.Counter(data_in).items():
        diff = item[1] - largest_with_ties[0][1]
        if diff < 0:
            continue
        if diff > 0:
            largest_with_ties.clear()
        largest_with_ties.append(item)
    return sorted(largest_with_ties)
"""

Now we can run:

print(f"riccardo_bucco: {timeit.timeit('riccardo_bucco(data_in)', setup=setup, number=100)}")
print(f"jonsg         : {timeit.timeit('jonsg(data_in)', setup=setup, number=100)}")

giving results like:

riccardo_bucco: 10.59
jonsg         : 10.55

Suggesting to me that they perform equally well (or poorly). Feel free to extend this with other attempts.

FYI: The actual most common is: ('the', 30087).

If one wants to test with the individual characters then data_in can be set via:

data_in = [char.lower() for char in file_in.read() if char.strip()]

In that case the most common is [('e', 482212)]

But doing so does not fundamentally alter the relative performance of these solutions.

Upvotes: 1

Riccardo Bucco
Riccardo Bucco

Reputation: 15364

Here is a possible solution (as discussed in the comments):

from collections import Counter

lst = # your list of characters
c = Counter(lst) # O(n)
largest = max(counts.values()) # O(n)
largest_with_ties = [k for k, v in counts.items() if v == largest] # O(n)
result = sorted(largest_with_ties)

Now, what's the complexity of sorted(largest_with_ties)? One could say that it's O(nlogn) (because there could be n/2 ties). However, the number of characters in largest_with_ties cannot be as large as the number of elements in lst. And that's because there is a much smaller number of characters compared to the possible number of ints. In other words, lst could potentially contain 10^20 numbers (just an example). But largest_with_ties can only contain different characters, and the number of characters that can be represented (for example) with UTF8 is limited to more or less 10^6. Therefore, technically the complexity of this last operation is O(1). In general, we could say that it's O(nlogn) but with an upper limit of O(10^6log10^6).

Upvotes: 4

Related Questions