Calculate Percentage in a dataframe according to groups R

I have the following dataframe and would like to calculate percentages according to Stage and Category. Some of my other data have another variable e.g. Year. I need the output on a dataframe to be able to use ggplot2.

Gender = rep(c("Female", "Male"), 6)
Stage = rep(c("Applied", "Appointed", "Interviewed"), each=2, times = 2)
Category = rep(c("Professional", "Research"), each = 6)
Count = as.integer(c("346", "251", "22", "15", "60", "52", "31", "230", "4", "17", "9", "52"))
df = data.frame(Gender, Stage, Category,Count )

The (horrible) code I wrote works for some instance but if the data structure changes e.g. Females with 0 count, the code will not work.

totals = aggregate(df$Count, by = list(Stage = df$Stage, Category = df$Category),sum)
totals = rep( totals$x, each = 2)
df$Percentage = round(df$Count/totals, 2)

This is the output I am after:

   Gender       Stage     Category Count Percentage
1  Female     Applied Professional   346       0.58
2    Male     Applied Professional   251       0.42
3  Female   Appointed Professional    22       0.59
4    Male   Appointed Professional    15       0.41
5  Female Interviewed Professional    60       0.54
6    Male Interviewed Professional    52       0.46
7  Female     Applied     Research    31       0.12
8    Male     Applied     Research   230       0.88
9  Female   Appointed     Research     4       0.19
10   Male   Appointed     Research    17       0.81
11 Female Interviewed     Research     9       0.15
12   Male Interviewed     Research    52       0.85

Thanks for your help!

Upvotes: 0

Views: 761

Answers (3)

mcsim
mcsim

Reputation: 1798

I suggest using data.table package. There you would be able to write something like:

library(data.table)
dt[,Percentage := round(Count / sum(Count), 2), by=c("Stage", "Category")]

The reason I suggest using data.table packages is that it is one of the fastest packages for data.frames. In general standard data frames are quite bad.

In comparison to dplyr, data.table is faster, but doesn't have transparent interface to SQL data bases.

The speed in data.table is mostly achieved by zero copying in data transformation.

Here is the manual

Upvotes: 1

bouncyball
bouncyball

Reputation: 10761

We can use the ave function:

df$Percentage <- df$Count / ave(df$Count, df$Stage, df$Category, FUN = sum)

   Gender       Stage     Category Count Percentage
1  Female     Applied Professional   346  0.5795645
2    Male     Applied Professional   251  0.4204355
3  Female   Appointed Professional    22  0.5945946
4    Male   Appointed Professional    15  0.4054054
5  Female Interviewed Professional    60  0.5357143
6    Male Interviewed Professional    52  0.4642857
7  Female     Applied     Research    31  0.1187739
8    Male     Applied     Research   230  0.8812261
9  Female   Appointed     Research     4  0.1904762
10   Male   Appointed     Research    17  0.8095238
11 Female Interviewed     Research     9  0.1475410
12   Male Interviewed     Research    52  0.8524590

Upvotes: 3

akrun
akrun

Reputation: 886928

We can use dplyr

library(dplyr)
df %>% 
   group_by(Stage, Category) %>%
   mutate(Percentage = round(Count/sum(Count), 2))

Upvotes: 2

Related Questions