Efficient way to get groupby label for the group with max count

Question

Lets say I take the iris data set for example. I sample the data randomly to get a subset of the data. Next I want to find the count of the number of classes so I group the data by Species and use the .count() function to get the count of the number of instances in each class. So far so good

Here is the code to do that:

import numpy as np
import pandas as pd
iris_df = pd.read_csv('./data/iris.csv') # this file has 150 rows
subset_df = iris_df.iloc[np.random.randint(1, 150, 60), ]
subset_df.groupby('Species', as_index = False).count()

## Output
      Species  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
0      setosa            19           19            19           19
1   virginica            20           20            20           20
2  versicolor            21           21            21           21

Now this is my question: Is there a way to get the group label of the instance with most samples. So in the above output: versicolor has the most number of samples so I want to get that group label.

I tried taking the max of the above line but that will sort the species column by character and return virginica which is definitely not correct, but the output makes sense.

There is one other way I can think of for getting the group name is by using .groups on the grouped by data frame by running the following code

species_dict = subset_df.groupby('Species', as_index = False).groups
max_ind = np.argmax([len(species_dict[k]) for k in species_dict.keys()])
print(list(species_dict.keys())[max_ind])

Is there a better way, more efficient way, using some Pandas functionality that I've missed. Please let me know

Sparrow0hawk · Accepted Answer

If i'm understanding your question correctly (that you want to return the most frequent label in your subset). I think you can do it without the groupby function just using pandas value_counts().

This creates a pandas series with the labels as the index and counts as the data. You can set it to sort the values from highest to lowest and then select out the top index.

# count values in Species column sorting most common to least common
subset_df.Species.value_counts(sort=True, ascending=False).index[0]

Efficient way to get groupby label for the group with max count

Answers (2)

Related Questions