Reputation: 35
How to find all of the most common elements in a python list (order alphabetically in case of tie) Suppose you have a list 'l = ['b', 'b', 'a', 'a', 'c']' the output should be ''' a 2, b 2 ''' I need a efficient solution of this in terms of time complexity. Thank you.
Upvotes: 0
Views: 765
Reputation: 89
Using your code and Counter we get the following:
from collections import Counter
def most_frequent(List):
occurence_count = Counter(List)
return occurence_count.most_common()
l = ['b', 'b', 'b', 'a', 'a', 'c']
print(sorted(most_frequent(l)))
We get the output:
[('a', 2), ('b', 3), ('c', 1)]
With sorted we sort the items automatically by alphabet.
You can double check that its sorting alphabetically with the following code:
tuplesInList = (most_frequent(l))
#to sort tuples within lists we use an anonymous function(lambda)
def sort_tuple_vals(my_tup):
my_tup.sort(key=lambda x:x[0])
return my_tup
This sorts the tuples by the first element in the tuple
print(sort_tuple_vals(tuplesInList))
gets the output
[('a', 2), ('b', 3), ('c', 1)]
If you want to sort by occurrence as opposed to alphabetically, then sort by alphabetically in case of a tie, the following code should work. We first sort the tuples by the number of occurrences with lambda x: x[1]
tuplesInList = (most_frequent(l))
#to sort tuples within lists we use an anonymous function(lambda)
def sortTupsbyOccurance(my_tup):
my_tup.sort(key=lambda x:x[1])
print(my_tup)
return my_tup
tuple=(sortTupsbyOccurance(tuplesInList))
print(tuple)
from an initial list l = ['b', 'b', 'b', 'a', 'a', 'a', 'c'] we get the output:
[('c', 1), ('b', 3), ('a', 3)]
using this I believe we solve the sorting by alphabet issue. if tuple at position n's second value is equal to the second value of tuple at positon n+1 we note the equality and go to our next if statement with those values. Because we start at position 0 we are sure to not skip any potential matches.
#set the n value for range 0, n to the number of tuple entries in the list.
for n in range(0,2):
if tuple[n][1]==tuple[n+1][1]:
print("there is an equality")
var1=str(tuple[n][0])
var2= str(tuple[n+1][0])
if var1 > var2:
print(var1, var2)
print("this is true")
# if tuple at position n's 1st value (alphabeticaly) is greater than tuple n+1's first value then we switch them.
tuple[n], tuple[n+1] = tuple[n+1], tuple[n]
print(tuple)
from initial list l with 3 "a"s and 3 "b"s we get the output:
[('c', 1), ('a', 3), ('b', 3)]
Upvotes: -1
Reputation: 13097
I wanted to see if I could make @riccardo-bucco's answer any faster (I could not) but I will show you an alternative that is basically the same speed that I thought might be faster.
To be clear, I feel the top answer is from @riccardo_bucco as it is easier to follow and is just as fast. Use it.
I was hoping that not having to scan the counter twice would more than make up for resetting the largest_with_ties list, but it did not.
def jonsg(data_in):
largest_with_ties = [(None, 0)]
for item in collections.Counter(data_in).items():
diff = item[1] - largest_with_ties[0][1]
if diff < 0:
continue
if diff > 0:
largest_with_ties.clear()
largest_with_ties.append(item)
return sorted(largest_with_ties)
Testing the timings I will use the words from "The Complete Works of William Shakespeare" from Project Guttenberg. You can get that here (5.5m): https://www.gutenberg.org/files/100/100-0.txt
Note, I have slightly altered Riccardo Bucco's answer to return a tuple not that it made a performance difference.
import timeit
setup = """
import collections
#data_in = ['b', 'b', 'a', 'a', 'c']
with open("shakespeare.txt", "r", encoding="utf-8") as file_in:
data_in = [word.strip().lower() for line in file_in for word in line.split()]
def riccardo_bucco(data_in):
counts = collections.Counter(data_in) # O(n)
largest = max(counts.values()) # O(n)
largest_with_ties = [item for item in counts.items() if item[1] == largest] # O(n)
return sorted(largest_with_ties)
def jonsg(data_in):
largest_with_ties = [(None, 0)]
for item in collections.Counter(data_in).items():
diff = item[1] - largest_with_ties[0][1]
if diff < 0:
continue
if diff > 0:
largest_with_ties.clear()
largest_with_ties.append(item)
return sorted(largest_with_ties)
"""
Now we can run:
print(f"riccardo_bucco: {timeit.timeit('riccardo_bucco(data_in)', setup=setup, number=100)}")
print(f"jonsg : {timeit.timeit('jonsg(data_in)', setup=setup, number=100)}")
giving results like:
riccardo_bucco: 10.59
jonsg : 10.55
Suggesting to me that they perform equally well (or poorly). Feel free to extend this with other attempts.
FYI: The actual most common is: ('the', 30087)
.
If one wants to test with the individual characters then data_in
can be set via:
data_in = [char.lower() for char in file_in.read() if char.strip()]
In that case the most common is [('e', 482212)]
But doing so does not fundamentally alter the relative performance of these solutions.
Upvotes: 1
Reputation: 15364
Here is a possible solution (as discussed in the comments):
from collections import Counter
lst = # your list of characters
c = Counter(lst) # O(n)
largest = max(counts.values()) # O(n)
largest_with_ties = [k for k, v in counts.items() if v == largest] # O(n)
result = sorted(largest_with_ties)
Now, what's the complexity of sorted(largest_with_ties)
? One could say that it's O(nlogn) (because there could be n/2 ties). However, the number of characters in largest_with_ties
cannot be as large as the number of elements in lst
. And that's because there is a much smaller number of characters compared to the possible number of ints. In other words, lst
could potentially contain 10^20 numbers (just an example). But largest_with_ties
can only contain different characters, and the number of characters that can be represented (for example) with UTF8 is limited to more or less 10^6. Therefore, technically the complexity of this last operation is O(1). In general, we could say that it's O(nlogn) but with an upper limit of O(10^6log10^6).
Upvotes: 4