justanewb
justanewb

Reputation: 133

Filter out dataframe based on values being within the 90th percentile

Suppose I have this dataframe

d = {'fruit': ['apple', 'apple', 'apple', 'orange'], 'names': ['bob', 'bob', 'bob', 'joe']}

df = pd.DataFrame(data=d)

Now I want to go through each column and filter out the low percentiles keeping only values that are contained in the 90th percentile.

Thus since apple and bob are each within their associated 90th percentiles I would have this dataframe

d = {'fruit': ['apple', 'apple', 'apple'], 'names': ['bob', 'bob', 'bob']}

df = pd.DataFrame(data=d)

How do I achieve this?

Upvotes: 1

Views: 209

Answers (1)

Inputvector
Inputvector

Reputation: 1093

Hope this helps:

  1. Calculate 90th percentile and keep them in a dictionary:
perc_dic = df.apply(lambda x : np.percentile(x.value_counts().tolist(), 90) , axis= 0).to_dict()

perc_dic 
{'fruit': 2.8, 'names': 2.8}

2.Calculate value counts of each value:

value_dic = df.apply(lambda x : x.value_counts() , axis= 0).to_dict()

value_dic
{'fruit': {'apple': 3.0, 'bob': nan, 'joe': nan, 'orange': 1.0},
 'names': {'apple': nan, 'bob': 3.0, 'joe': 1.0, 'orange': nan}}
  1. Compare value counts and percentile:
df_result =  df.apply(lambda row : row  if value_dic['fruit'][row['fruit']] >= perc_dic['fruit']  and
                                           value_dic['names'][row['names']] >= perc_dic['names']
                                          else np.nan , axis= 1)

df_result.dropna()
    fruit   names
0   apple   bob
1   apple   bob
2   apple   bob

Note: There are some nan values in value_dic but it won't hurt the performance.

Upvotes: 1

Related Questions