Mehmet Yildirim
Mehmet Yildirim

Reputation: 501

How to include number of observations in each quartile of boxplot using ggplot2 in R?

I am plotting a box-plot to see the distribution of the variable. I am also interested in seeing the number of observations in each quartile. Is there any way to add the number of observations in each quartile to the boxplot along with the values of quartiles?

I included some code below which can generate box-plot with the values of quartiles.

df <- datasets::iris
boxplot <- ggplot(df, aes(x = "", y = Sepal.Length)) +
  geom_boxplot(width=0.1, position = "dodge", fill = "red") +
  stat_boxplot(geom = "errorbar", width = 0.1) +
  stat_summary(geom = "label_repel", fun.y = quantile, aes(label = ..y..),
               position = position_nudge(x = -0.1), size = 3) +
  ggtitle("") +
  xlab("") +
  ylab('Sepal.Length')

I expect the values of quartiles on the left-hand side of the plot and the number of observations on the right-hand side of the plot if possible.

Upvotes: 1

Views: 657

Answers (2)

Mehmet Yildirim
Mehmet Yildirim

Reputation: 501

@TobiO 's answer is correct. But, my data was kind of skewed and some cut points were the same (such as the first and second cut points were the same). I needed to take the unique values to calculate the number of observations in each quartile. Another point is related to usage of cut function which does not include the starting point (low bound, high bound]. In order to include the starting point, I have used the cut2 function from the Hmisc package. I included a label_pos_extension line in order to prevent the overlap of label/text for the quartiles whose cut points are very close to each other. geom_text_repel did not work for preventing the overlaps.

quantile_counts2 <- function(x){
  label_pos_extension <- c(0,3,4,0)
  if(length(unique(quantile(x))) < 5){
    df <- data.frame(label = table(cut2(x, g = 4)),
                 label_pos =  c(0, diff(unique(quantile(x))) / 2 + quantile(x)[1:length(unique(quantile(x)))-1]) + label_pos_extension[1:length(unique(quantile(x)))])
  } else {
    df <- data.frame(label = table(cut2(x, g = 4)),
                 label_pos = diff(quantile(x)) / 2 + quantile(x)[1:4] + label_pos_extension)
  } return(df)
}

PS. I tried to put my edited function in comment but, it did not work.

Upvotes: 1

TobiO
TobiO

Reputation: 1381

this would be one possibility. I always prefer to have my additional data as an extra data frame, because this gives me more control on what is how calculated.

Counting made with some inspiration from https://stackoverflow.com/a/54451575

quantile_counts=function(x){
 df= data.frame(label=table(cut(x, quantile(x))),
             label_pos=diff(quantile(x))/2+quantile(x)[1:4])
 return(df)
}

df_quantile_counts=quantile_counts(df$Sepal.Length)

boxplot <- ggplot(df, aes(x = "", y = Sepal.Length)) +
  geom_boxplot(width=0.1, position = "dodge", fill = "red") +
  stat_boxplot(geom = "errorbar", width = 0.1) +
  stat_summary(geom = "label", fun.y = quantile, aes(label = ..y..),
               position = position_nudge(x = -0.1), size = 3) +
  geom_text(data=df_quantile_counts,aes(x="",y=label_pos,label = label.Freq),
            position = position_nudge(x = +0.1), size = 3) +
  ggtitle("") +
  xlab("") +
  ylab('Sepal.Length')

HTH, Tobi

Upvotes: 3

Related Questions