Reputation: 3
Lets say I take the iris
data set for example. I sample the data randomly to get a subset of the data. Next I want to find the count of the number of classes so I group the data by Species and use the .count()
function to get the count of the number of instances in each class. So far so good
Here is the code to do that:
import numpy as np
import pandas as pd
iris_df = pd.read_csv('./data/iris.csv') # this file has 150 rows
subset_df = iris_df.iloc[np.random.randint(1, 150, 60), ]
subset_df.groupby('Species', as_index = False).count()
## Output
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
0 setosa 19 19 19 19
1 virginica 20 20 20 20
2 versicolor 21 21 21 21
Now this is my question: Is there a way to get the group label of the instance with most samples. So in the above output: versicolor
has the most number of samples so I want to get that group label.
I tried taking the max of the above line but that will sort the species column by character and return virginica
which is definitely not correct, but the output makes sense.
There is one other way I can think of for getting the group name is by using .groups
on the grouped by data frame by running the following code
species_dict = subset_df.groupby('Species', as_index = False).groups
max_ind = np.argmax([len(species_dict[k]) for k in species_dict.keys()])
print(list(species_dict.keys())[max_ind])
Is there a better way, more efficient way, using some Pandas functionality that I've missed. Please let me know
Upvotes: 0
Views: 379
Reputation: 577
If i'm understanding your question correctly (that you want to return the most frequent label in your subset). I think you can do it without the groupby function just using pandas value_counts().
This creates a pandas series with the labels as the index and counts as the data. You can set it to sort the values from highest to lowest and then select out the top index.
# count values in Species column sorting most common to least common
subset_df.Species.value_counts(sort=True, ascending=False).index[0]
Upvotes: 0