How to calculate the percentage in different rows of one column?

Question

I'm trying to calculate the value as a percentage of the occupation and year. As an example, using df below, the first row's percentage would be:

665 /(665+709) = 48.4

I was able to use aggregate to calculate the mean, but am stuck on how to calculate the percentages: aggregate(x=df$value, by=list(df$occupation, df$year),FUN = mean)

df <- data.frame(
  year = c(rep(2003, 8), rep(2005, 8)),
  sex = c(rep(0, 4), rep(1, 4)),
  occupation = rep(c(1:4), 4),
  value = c(665, 661, 695, 450, 709, 460, 1033, 346, 808, 959, 651, 468, 756, 832, 1140, 431)
)

Adam Bethke · Accepted Answer

I think the answer you are looking for is:

aggregate(
  x = df$value,
  by = list(df$occupation, df$year),
  FUN = function(x) {
    round(x / sum(x) * 100, 1)
  }
)

Basically, the crux of the answer lies in the FUN argument; to calculate the percentage, you need a function telling R what to do when aggregating. Since R has a built-in mean function, you were able to supply mean to FUN when calculating the mean. The functional programming chapter of Hadley Wickham's Advanced R has a lot more detail on building named and anonymous functions.

That said, for data manipulation tasks like this, packages like dplyr really excel at making the task less complex and easier to read. You could use the aggregate answer above, but unless you have a reason to (e.g. building a package and you want to avoid dependencies), the additional package can make your code more readable and maintainable.

library(dplyr)
output <- 
  df %>%
  group_by(year, occupation) %>%
  mutate(percent = round(value / sum(value) * 100, 1))

The other benefit to this approach is that it adds to your original data structure in a cleaner way than the aggregate, which produces usable but not pretty results by default.

This vignette has a bunch of great examples of these types of data manipulation tasks. The dplyr/tidyr cheatsheet is also helpful for these kinds of tasks.

My answer relies dplyr because it is my go to tool; there are definitely others (plyr, data.table) which may be better suited to a given task. I still like dplyr for this problem, but I mention other options because it is always worth thinking about the best tool for the job.

How to calculate the percentage in different rows of one column?

Answers (1)

Related Questions