Reputation: 171
I want to convert following whole data set into percentage.
https://cocl.us/datascience_survey_data
To find out the percentage sum of that row should be used.
e.g. for Big Data (Spark / Hadoop) = 1332 + 729 + 127 = 2188
So the percentage will be Very interested: 60.87%
I want to automate this for all rows. How to do it?
Upvotes: 0
Views: 356
Reputation: 862511
You can divide all data of columns with DataFrame.div
by sum
per rows and then multiple by 100
:
df = pd.read_csv('Topic_Survey_Assignment.csv', index_col=0)
df1 = df.div(df.sum(axis=1), axis=0).mul(100)
print (df1)
Very interested Somewhat interested \
Big Data (Spark / Hadoop) 60.877514 33.318099
Data Analysis / Statistics 77.007299 20.255474
Data Journalism 20.235849 50.990566
Data Visualization 61.580882 33.731618
Deep Learning 58.229599 35.500231
Machine Learning 74.724771 21.880734
Not interested
Big Data (Spark / Hadoop) 5.804388
Data Analysis / Statistics 2.737226
Data Journalism 28.773585
Data Visualization 4.687500
Deep Learning 6.270171
Machine Learning 3.394495
Detail:
print (df.sum(axis=1))
Big Data (Spark / Hadoop) 2188
Data Analysis / Statistics 2192
Data Journalism 2120
Data Visualization 2176
Deep Learning 2169
Machine Learning 2180
dtype: int64
Numpy alternative is very similar:
df = pd.read_csv('Topic_Survey_Assignment.csv', index_col=0)
arr = df.values
df1 = pd.DataFrame(arr / np.sum(arr, axis=1)[:, None] * 100,
index=df.index,
columns=df.columns)
print (df1)
Very interested Somewhat interested \
Big Data (Spark / Hadoop) 60.877514 33.318099
Data Analysis / Statistics 77.007299 20.255474
Data Journalism 20.235849 50.990566
Data Visualization 61.580882 33.731618
Deep Learning 58.229599 35.500231
Machine Learning 74.724771 21.880734
Not interested
Big Data (Spark / Hadoop) 5.804388
Data Analysis / Statistics 2.737226
Data Journalism 28.773585
Data Visualization 4.687500
Deep Learning 6.270171
Machine Learning 3.394495
Upvotes: 3
Reputation: 1
import pandas as pd
df= pd.read_csv('filename.csv')
df['very_interested_pct']=(df['Very interested']/(df['Somewhat interested']+df['Very interested']+df['Not interested']))*100
This will create a new column called very_interested_pct, you could do the same for the other two columns and delete the previous columns.
Upvotes: 0
Reputation: 1571
The fastest option is to go with numpy. No matter how big the data is, the calculation will be fast
import numpy as np
#get the values
values = data[['Very interested', 'Somewhat interested', 'Not interested']].values
#get the sum of each row
sums = values.sum(axis=1).T
#reshape the sums for the purposes of division
sums = np.reshape(sums, (-1, 1))
#divide each value with the sum value and multiply with 100
percentages = (values / sums) * 100
#assign the calculatiton back to the original data
data[['Very interested', 'Somewhat interested', 'Not interested']] = percentages
#print the data
print(data)
Unnamed: 0 Very interested Somewhat interested Not interested
0 Big Data (Spark / Hadoop) 60.877514 33.318099 5.804388
1 Data Analysis / Statistics 77.007299 20.255474 2.737226
2 Data Journalism 20.235849 50.990566 28.773585
3 Data Visualization 61.580882 33.731618 4.687500
4 Deep Learning 58.229599 35.500231 6.270171
5 Machine Learning 74.724771 21.880734 3.394495
Upvotes: 1