Reputation: 27
Consider the following dataframe:
ID Column
0 500 2
1 500 2
2 500 2
3 500 2
4 500 2
5 500 4
How can we see if the most common value of 'Column' appears more than X% of the times?
I've tried to do: df.locate[df.groupby('ID')['Column'].count_values(normalize=True).max() > X]
, but I get an error.
Upvotes: 1
Views: 38
Reputation: 12808
I think what you had was close to a solution. It's not really clear to me, if you want to calculate this just over the whole column, or per group, so here's a solution for both. You can change variable at_least_this_proportion
, to change the minimum threshold:
import pandas as pd
from io import StringIO
text = """
ID Column
0 500 2
1 500 2
2 500 2
3 500 2
4 500 2
5 500 4
6 501 2
7 501 2
"""
df = pd.read_csv(StringIO(text), header=0, sep='\s+')
# set minimum threshold
at_least_this_proportion = 0.5
Calculate per group:
# find the value that occurs at least 50% within its group
value_counts_per_group = df.groupby('ID')['Column'].value_counts(normalize=True)
ids_that_meet_threshold = value_counts_per_group[value_counts_per_group > at_least_this_proportion].index.get_level_values(0)
# get all rows for which the id meets the threshold
df[df['ID'].isin(ids_that_meet_threshold)]
Upvotes: 1