user13846418
user13846418

Reputation:

Selecting top % of rows in pandas

I have a sample dataframe as below (actual dataset is roughly 300k entries long):


        user_id   revenue  
 ----- --------- --------- 
    0       234       100  
    1      2873       200  
    2       827       489  
    3        12       237  
    4      8942     28934  
  ...       ...       ...  
   96       498    892384  
   97      2345        92  
   98       239      2803  
   99      4985     98332  
  100       947      4588  

which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).

The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.

Can anybody propose a way for this? Thank you!

Upvotes: 0

Views: 468

Answers (4)

nimbous
nimbous

Reputation: 1537

I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:

import pandas as pd

def n_percent_revenue_generating_users(df, col, n_percent):
    df.sort_values(by=[col], ascending=False, inplace=True)
    df[f'{col}_cs'] = df[col].cumsum()
    df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
    df_ = df[df[f'{col}_csp'] > n_percent]
    index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
    threshold_revenue = df_.loc[index_nearest, col]
    output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
    
    return output
    
n_percent_revenue_generating_users(df, 'revenue', 20) 

Upvotes: 0

ipj
ipj

Reputation: 3598

Suppose You have dataframe df:

user_id revenue
234     21  
2873    20  
827     23  
12      23  
8942    28  
498     22  
2345    20  
239     24  
4985    21  
947     25

I've flatten revenue distribution to show the idea. Now calculating step by step:

df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df

result:

   user_id  revenue  revenue_cum  %revenue_cum
4     8942       28           28      0.123348
9      947       25           53      0.233480
7      239       24           77      0.339207
2      827       23          100      0.440529
3       12       23          123      0.541850
5      498       22          145      0.638767
0      234       21          166      0.731278
8     4985       21          187      0.823789
1     2873       20          207      0.911894
6     2345       20          227      1.000000

Only 2 top users generate 23.3% of total revenue.

Upvotes: 1

naccode
naccode

Reputation: 520

I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:

# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)

# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()

# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]

The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.

Upvotes: 0

Celius Stingher
Celius Stingher

Reputation: 18377

This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.

A case example from your dataset:

import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
                           'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')

This would print the top 2 rows in value:

     user_id  revenue
0.8     2873      489
1.0     8942    28934

Upvotes: 0

Related Questions