Reputation: 133
Suppose I have this dataframe
d = {'fruit': ['apple', 'apple', 'apple', 'orange'], 'names': ['bob', 'bob', 'bob', 'joe']}
df = pd.DataFrame(data=d)
Now I want to go through each column and filter out the low percentiles keeping only values that are contained in the 90th percentile.
Thus since apple
and bob
are each within their associated 90th percentiles I would have this dataframe
d = {'fruit': ['apple', 'apple', 'apple'], 'names': ['bob', 'bob', 'bob']}
df = pd.DataFrame(data=d)
How do I achieve this?
Upvotes: 1
Views: 209
Reputation: 1093
Hope this helps:
perc_dic = df.apply(lambda x : np.percentile(x.value_counts().tolist(), 90) , axis= 0).to_dict()
perc_dic
{'fruit': 2.8, 'names': 2.8}
2.Calculate value counts of each value:
value_dic = df.apply(lambda x : x.value_counts() , axis= 0).to_dict()
value_dic
{'fruit': {'apple': 3.0, 'bob': nan, 'joe': nan, 'orange': 1.0},
'names': {'apple': nan, 'bob': 3.0, 'joe': 1.0, 'orange': nan}}
df_result = df.apply(lambda row : row if value_dic['fruit'][row['fruit']] >= perc_dic['fruit'] and
value_dic['names'][row['names']] >= perc_dic['names']
else np.nan , axis= 1)
df_result.dropna()
fruit names
0 apple bob
1 apple bob
2 apple bob
Note: There are some nan values in value_dic but it won't hurt the performance.
Upvotes: 1