get count with stat_frequency

Question

I have a routine to create some plots with ggplot :

getPlotList = function(param.list, data=db, y, color){
  param.list %>% sapply(function(var){
    ggplot(data=data, aes(x=data[[var]], y=data[[y]], color=data[[color]]))+
      stat_summary(fun.y = mean, fun.ymin = function(x){mean(x) - sem(x)}, fun.ymax = function(x){mean(x) + sem(x)}, geom = "errorbar", width=.1, position = position_dodge(0.3), na.rm = TRUE) +
      stat_summary(fun.y = mean, geom = "point", position = position_dodge(0.3), na.rm = TRUE) +
      ylim(0, NA) +
  }, simplify = FALSE, USE.NAMES = TRUE)
}

Which I use like this :

c("col1", "col2", "col3") %>% getPlotList(y="col4", color="col5")

This works perfectly (I have dozens of plots to write), and give a result like this (but without the n=... labels) :

The thing is, my count is the same for every color, but it can change with x.
Since there are errorbars (which won't show if n=1 or n=0), I have to show the count in labels, like I did on the picture (with Paint).

There are a lots of similar questions on SO (like this one, this one, this one, etc...), but all use geom_hist or geom_bar, which happen to have the ..count.. metavariable available, unlike the stat_summary I'm using.

How could I add those labels ?

PS : I tried to use quosures instead of data[[...]] in my function but failed miserably. This is not the main part of the question but if anybody has an idea this would help me quite much.

Mark Peterson · Accepted Answer

This is built using these sample data:

sampleData <-
  data.frame(
    col1 = factor(rep(LETTERS[1:4], c(12, 6, 16, 20)*5)
                  , levels = LETTERS[1:4])
    , col2 = factor(rep(LETTERS[1:4], c(1, 17, 16, 20)*5)
                    , levels = LETTERS[1:4])
    , col3 = factor(rep(LETTERS[1:4], c(0, 18, 16, 20)*5)
                    , levels = LETTERS[1:4])
    , col4 = rnorm(54*5, 4, 2)
    , col5 = factor(rep(1:5, 54))
  )

The basic approach is to simply add the label yourself manually. For that, I used table to count the occurrences of each X/color and generated a new data.frame to display those. Note that, while you say that each color within the X groupings always has the same sample size, it is better to program defensively. Instead of trusting that (and, e.g., using the counts for the first color), I use apply to get all of the unique values. As long as there is only one, the effect is the same. However, if there are more than one, this will give you an indication.

In addition, I went ahead and switched the mapping to use aes_string so that it will populate through your column labels. If you don't like that behavior, just override with ylab etc.

Similarly, the function sem was not found (I assume it is a custom function), so I used the mean_cl_normal function instead, which has the added advantage of utilizing the fun.data argument for cleaner code. (I also prefer confidence intervals to just showing SEM, but that is more style than substance).

getPlotList = function(param.list, data=db, y, color){
  param.list %>% sapply(function(var){

    myCounts <- table(data[[var]], data[[color]])

    forLabels <-
      data.frame(
        x = row.names(myCounts)
        , label = paste("n =", apply(myCounts, 1, function(x){paste(unique(x), collapse = ";")}))
        , y = 0.5
      )

    ggplot(data=data, aes_string(x=var, y=y, color=color))+
      stat_summary(fun.data = mean_cl_normal, position = position_dodge(0.3), na.rm = TRUE) +
      stat_summary(fun.y = mean, geom = "point", position = position_dodge(0.3), na.rm = TRUE) +
      ylim(0, NA) +
      geom_text(aes(x = x, y = y, label = label, color = NA)
                , forLabels
                , show.legend = FALSE)
  }, simplify = FALSE, USE.NAMES = TRUE)
}

Now, this code:

c("col1", "col2", "col3") %>% getPlotList(y="col4", color="col5", data = sampleData)

gives the following plots:

At the request of @Nettle, I modified the code to use a bit more of the tidyverse, specifically using Standard Evaluation to loop through the column list instead of using the base table approach from before. I believe that the code should function identically. The main advantage is removing the intermediate variables, though one could argue that those improve readability.

getPlotList <- function(param.list, data=db, y, color){
  param.list %>% sapply(function(var){

    ggplot(data=data, aes_string(x=var, y=y, color=color))+
      stat_summary(fun.data = mean_cl_normal, position = position_dodge(0.3), na.rm = TRUE) +
      stat_summary(fun.y = mean, geom = "point", position = position_dodge(0.3), na.rm = TRUE) +
      ylim(0, NA) +
      geom_text(aes_string(x = var, y = "y", label = "label", color = NA)
                , data %>%
                  count(!!as.name(var), !!as.name(color)) %>%
                  group_by(!!as.name(var)) %>%
                  summarise(
                    label = paste("n =", paste(unique(n), collapse = ";"))
                  ) %>%
                  mutate(y = 0.5)
                , show.legend = FALSE)

  }, simplify = FALSE, USE.NAMES = TRUE)
}

get count with stat_frequency

Answers (1)

Related Questions