skad00sh
skad00sh

Reputation: 171

Convert whole dataset into percentage

I want to convert following whole data set into percentage.

https://cocl.us/datascience_survey_data

To find out the percentage sum of that row should be used.

e.g. for Big Data (Spark / Hadoop) = 1332 + 729 + 127 = 2188

So the percentage will be Very interested: 60.87%

I want to automate this for all rows. How to do it?

Upvotes: 0

Views: 356

Answers (3)

jezrael
jezrael

Reputation: 862511

You can divide all data of columns with DataFrame.div by sum per rows and then multiple by 100:

df = pd.read_csv('Topic_Survey_Assignment.csv', index_col=0)

df1 = df.div(df.sum(axis=1), axis=0).mul(100)
print (df1)
                            Very interested  Somewhat interested  \
Big Data (Spark / Hadoop)         60.877514            33.318099   
Data Analysis / Statistics        77.007299            20.255474   
Data Journalism                   20.235849            50.990566   
Data Visualization                61.580882            33.731618   
Deep Learning                     58.229599            35.500231   
Machine Learning                  74.724771            21.880734   

                            Not interested  
Big Data (Spark / Hadoop)         5.804388  
Data Analysis / Statistics        2.737226  
Data Journalism                  28.773585  
Data Visualization                4.687500  
Deep Learning                     6.270171  
Machine Learning                  3.394495  

Detail:

print (df.sum(axis=1))
Big Data (Spark / Hadoop)     2188
Data Analysis / Statistics    2192
Data Journalism               2120
Data Visualization            2176
Deep Learning                 2169
Machine Learning              2180
dtype: int64

Numpy alternative is very similar:

df = pd.read_csv('Topic_Survey_Assignment.csv', index_col=0)

arr = df.values
df1 = pd.DataFrame(arr / np.sum(arr, axis=1)[:, None] * 100,
                   index=df.index,
                   columns=df.columns)
print (df1)
                            Very interested  Somewhat interested  \
Big Data (Spark / Hadoop)         60.877514            33.318099   
Data Analysis / Statistics        77.007299            20.255474   
Data Journalism                   20.235849            50.990566   
Data Visualization                61.580882            33.731618   
Deep Learning                     58.229599            35.500231   
Machine Learning                  74.724771            21.880734   

                            Not interested  
Big Data (Spark / Hadoop)         5.804388  
Data Analysis / Statistics        2.737226  
Data Journalism                  28.773585  
Data Visualization                4.687500  
Deep Learning                     6.270171  
Machine Learning                  3.394495  

Upvotes: 3

Amirali Madani
Amirali Madani

Reputation: 1

import pandas as pd
df= pd.read_csv('filename.csv')
df['very_interested_pct']=(df['Very interested']/(df['Somewhat interested']+df['Very interested']+df['Not interested']))*100

This will create a new column called very_interested_pct, you could do the same for the other two columns and delete the previous columns.

Upvotes: 0

1__
1__

Reputation: 1571

The fastest option is to go with numpy. No matter how big the data is, the calculation will be fast

import numpy as np
#get the values
values = data[['Very interested', 'Somewhat interested', 'Not interested']].values
#get the sum of each row
sums = values.sum(axis=1).T
#reshape the sums for the purposes of division
sums = np.reshape(sums, (-1, 1))
#divide each value with the sum value and multiply with 100
percentages = (values / sums) * 100
#assign the calculatiton back to the original data
data[['Very interested', 'Somewhat interested', 'Not interested']] = percentages
#print the data
print(data)
Unnamed: 0  Very interested Somewhat interested Not interested
0   Big Data (Spark / Hadoop)   60.877514   33.318099   5.804388
1   Data Analysis / Statistics  77.007299   20.255474   2.737226
2   Data Journalism 20.235849   50.990566   28.773585
3   Data Visualization  61.580882   33.731618   4.687500
4   Deep Learning   58.229599   35.500231   6.270171
5   Machine Learning    74.724771   21.880734   3.394495

Upvotes: 1

Related Questions