Reputation: 17418

calculate percentages of levels - add column name

This code:

tips <- data.frame(
        gender = c("female", "male", "male")
        ,smoker = c("yes", "no", "no")
     )

tblFun <- function(x) {
    tbl <- table(x)
    res <- cbind(tbl, round(prop.table(tbl) * 100, 2))
    colnames(res) <- c('Count', 'Percentage')
    res
}

do.call(rbind, lapply(tips[1:2], tblFun))

produces this:

       Count Percentage
female     1      33.33
male       2      66.67
no         2      66.67
yes        1      33.33

which is great. However, I would like to produce this:

key_value_pair       Count Percentage
gender=female     1      33.33
gender=male       2      66.67
smoker=no         2      66.67
smoker=yes        1      33.33

Could someone please be so kind and suggest a solution? Thanks!

Upvotes: 0

Answers (2)

r2evans

Reputation: 160607

One way is to expand your tblFun function to accept the category name and prepend it to the labels.

tblFun <- function(x, nm = character(0)) {
  tbl <- table(x)
  if (length(nm)) names(tbl) <- paste(nm[[1]], names(tbl), sep = "=")
  res <- cbind(tbl, round(prop.table(tbl) * 100, 2))
  colnames(res) <- c('Count', 'Percentage')
  res
}

Given no change, it behaves as before:

do.call(rbind, lapply(tips[1:2], tblFun))
#        Count Percentage
# female     1      33.33
# male       2      66.67
# no         2      66.67
# yes        1      33.33

In order to pass each column's name with each column, you need to use the multi-argument version of lapply, Map:

do.call(rbind, Map(tblFun, tips[1:2], names(tips[1:2])))
#               Count Percentage
# gender=female     1      33.33
# gender=male       2      66.67
# smoker=no         2      66.67
# smoker=yes        1      33.33

An alternative is to use purrr::imap, which passes both the object and its name (as a second argument) to the function:

do.call(rbind, purrr::imap(tips[1:2], tblFun))
#               Count Percentage
# gender=female     1      33.33
# gender=male       2      66.67
# smoker=no         2      66.67
# smoker=yes        1      33.33

One advantage I see is no need to include both tips[1:2] and names(tips[1:2]), though if you aren't already using purrr or tidyverse-packages, then adding another package just for this might not be desired (esp when Map does the same thing with the explicit names()).

As a brief demonstration of what Map is doing: it "zips" arguments together.

As a comparison, lapply (and family) run a function once for each element of its input vector/list. So lapply(1:3, myfunc) "unrolls" to

list(
  myfunc(1),
  myfunc(2),
  myfunc(3)
)

If you try to provide multiple vectors, however, it does not perform as one "might" want/think: lapply(1:3, myfunc, 11:13) unrolls to:

list(
  myfunc(1, 11:13),
  myfunc(2, 11:13),
  myfunc(3, 11:13)
)

Map does it for an arbitrary number of vectors/lists, so Map(myfunc, 1:3, 11:13, 21:23, 99) unrolls to

list(
  myfunc(1, 11, 21, 99),
  myfunc(2, 12, 22, 99),
  myfunc(3, 13, 23, 99)
)

(Notice how length-one vectors are recycled. Though it does do recycling of lengths between 1 and the length of the longest vector, I don't recommend relying on it unless you strictly control the fact that the shorter vectors should multiply out to the length of the longest with no remainder.)

myfunc in this case must accept (at least) three arguments. Two notable differences between lapply and Map:

lapply puts data-first, function-second; because Map accepts one or more vectors/lists, it puts the function-first, one-or-more data second+;
Map will recycle a singular list argument, so Map(myfunc, 1:3, list(11:13) unrolls to list(myfunc(1, 11:13), myfunc(2, 11:13), myfunc(3, 11:13)), which on the surface looks very similar to lapply(1:3, myfunc, 11:13) but can be handy when you have more than two vectors of inputs.
Because it doesn't care how many vectors/lists, one can send arbitrary/unknown lengths of vectors/lists to Map with do.call, as in
```
l <- list(1:3, 11:13, 21:21)
do.call("Map", c(f = myfunc, l))
```
(as long as myfunc accepts arbitrary number of arguments, likely via ... mechanics). The first and only named argument of Map is f= for the function; it isn't required to name it here, but I like to for clarity.
Just like lapply has the optionally-simplifying version sapply, Map has the optionally-simplifying version mapply. I tend to prefer explicit -- there's nothing more frustrating than expecting vector output but one input causes the output to be a list ...

Upvotes: 2

bouncyball

Reputation: 10771

I would use the tidyverse and some data manipulation:

library(tidyverse)

tips %>%
    gather(key_value, value) %>% # wide to long
    count(key_value, value) %>%
    group_by(key_value) %>%
    mutate(percentage = n / sum(n)) %>%
    unite(key_value_pair, key_value, value, sep = "=") # convert 2 cols into 1

#   key_value_pair     n percentage
#   <chr>          <int>      <dbl>
# 1 gender=female      1      0.333
# 2 gender=male        2      0.667
# 3 smoker=no          2      0.667
# 4 smoker=yes         1      0.333

Upvotes: 3

calculate percentages of levels - add column name

Answers (2)

Related Questions