Reputation: 87
I have the following dataframe and would like to calculate percentages according to Stage and Category. Some of my other data have another variable e.g. Year. I need the output on a dataframe to be able to use ggplot2.
Gender = rep(c("Female", "Male"), 6)
Stage = rep(c("Applied", "Appointed", "Interviewed"), each=2, times = 2)
Category = rep(c("Professional", "Research"), each = 6)
Count = as.integer(c("346", "251", "22", "15", "60", "52", "31", "230", "4", "17", "9", "52"))
df = data.frame(Gender, Stage, Category,Count )
The (horrible) code I wrote works for some instance but if the data structure changes e.g. Females with 0 count, the code will not work.
totals = aggregate(df$Count, by = list(Stage = df$Stage, Category = df$Category),sum)
totals = rep( totals$x, each = 2)
df$Percentage = round(df$Count/totals, 2)
This is the output I am after:
Gender Stage Category Count Percentage
1 Female Applied Professional 346 0.58
2 Male Applied Professional 251 0.42
3 Female Appointed Professional 22 0.59
4 Male Appointed Professional 15 0.41
5 Female Interviewed Professional 60 0.54
6 Male Interviewed Professional 52 0.46
7 Female Applied Research 31 0.12
8 Male Applied Research 230 0.88
9 Female Appointed Research 4 0.19
10 Male Appointed Research 17 0.81
11 Female Interviewed Research 9 0.15
12 Male Interviewed Research 52 0.85
Thanks for your help!
Upvotes: 0
Views: 761
Reputation: 1798
I suggest using data.table package. There you would be able to write something like:
library(data.table)
dt[,Percentage := round(Count / sum(Count), 2), by=c("Stage", "Category")]
The reason I suggest using data.table packages is that it is one of the fastest packages for data.frames. In general standard data frames are quite bad.
In comparison to dplyr, data.table is faster, but doesn't have transparent interface to SQL data bases.
The speed in data.table is mostly achieved by zero copying in data transformation.
Here is the manual
Upvotes: 1
Reputation: 10761
We can use the ave
function:
df$Percentage <- df$Count / ave(df$Count, df$Stage, df$Category, FUN = sum)
Gender Stage Category Count Percentage
1 Female Applied Professional 346 0.5795645
2 Male Applied Professional 251 0.4204355
3 Female Appointed Professional 22 0.5945946
4 Male Appointed Professional 15 0.4054054
5 Female Interviewed Professional 60 0.5357143
6 Male Interviewed Professional 52 0.4642857
7 Female Applied Research 31 0.1187739
8 Male Applied Research 230 0.8812261
9 Female Appointed Research 4 0.1904762
10 Male Appointed Research 17 0.8095238
11 Female Interviewed Research 9 0.1475410
12 Male Interviewed Research 52 0.8524590
Upvotes: 3
Reputation: 886928
We can use dplyr
library(dplyr)
df %>%
group_by(Stage, Category) %>%
mutate(Percentage = round(Count/sum(Count), 2))
Upvotes: 2