Ronith
Ronith

Reputation: 199

Calculating the percentage of a category in Pandas

I have a dataframe train and I have filtered a certain number of rows from the train dataframe to form the promoted dataframe:

print(train.department.value_counts(),'\n')
promoted=train[train.is_promoted==1]
print(promoted.department.value_counts())

The output of the above code is:

Sales & Marketing    16840
Operations           11348
Technology            7138
Procurement           7138
Analytics             5352
Finance               2536
HR                    2418
Legal                 1039
R&D                    999
Name: department, dtype: int64

Sales & Marketing    1213
Operations           1023
Technology            768
Procurement           688
Analytics             512
Finance               206
HR                    136
R&D                    69
Legal                  53
Name: department, dtype: int64

I want to display how much percentage of each category of the column department has appeared from the train in the promoted dataframe,i.e Instead of the numbers 1213,1023,768,688,etc. I should get a percentage such as: 1213/16840*100=7.2,etc. Please note that I don't want a normalized value.

Upvotes: 1

Views: 13716

Answers (4)

Sarfraaz Ahmed
Sarfraaz Ahmed

Reputation: 617

Found a better answer at : https://stackoverflow.com/a/50558594/4106458

It suggests to use normalize=True named parameter for value_counts() method

For your scenario, the code would be :

promoted.department.value_counts(normalize=True) * 100

Upvotes: 2

spaceman
spaceman

Reputation: 413

import pandas as pd
df = pd.read_csv("/home/spaceman/my_work/Most-Recent-Cohorts-Scorecard-Elements.csv")
df=df[['STABBR']] #each values is appearing in dataframe with multiple 
#after that i got  
CA    717
TX    454
NY    454
FL    417
PA    382
OH    320
IL    280
MI    189
NC    189
.........
.........

print df['STABBR'].value_counts(normalize=True) #returns the relative 
frequency by dividing all values by the sum of values
CA    0.099930
TX    0.063275
NY    0.063275
FL    0.058118
PA    0.053240
OH    0.044599
IL    0.039024
MI    0.026341
NC    0.026341
..............
..............

Upvotes: 0

smj
smj

Reputation: 1284

How about this? Example has a toy dataset, but the key idea is simply dividing one value count by the other.

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'department': list(range(10)) * 100,
    'is_promoted': np.random.randint(0, 2, size =  1000)
})

# Slice out promoted data.

data_promoted = data[data['is_promoted'] == 1]

# Calculate share of each department that is present in data_promoted.

data_promoted['department'].value_counts().sort_index() / data['department'].value_counts().sort_index()

Gives:

0    0.50
1    0.52
2    0.45
3    0.54
4    0.41
5    0.50
6    0.45
7    0.52
8    0.60
9    0.52
Name: department, dtype: float64

Upvotes: 1

Abhi
Abhi

Reputation: 4233

Try:

promoted.department.value_counts()/train.department.value_counts()*100

It should give you the desired output:

Sales & Marketing    7.2030
Operations           9.0148
Technology          10.7593 
.....                 ...
Name: department, dtype: int64

Upvotes: 2

Related Questions