Sebastian Hesse
Sebastian Hesse

Reputation: 545

Label the extreme values in each group in ggplot

In a ggplot combination of geom_violin and geom_point, I would like to label the extreme values of each group. However, this should be strictly group-based. My solution so far just gives me the extreme values of every group and labels each of them in every group. As in the real data I have 10 groups and 3 extremes this makes the graph unreadable.

Data

df <- data.frame(group = c('XvsHD', 'XvsHD', 'XvsHD', 'XvsHD', 'YvsHD', 'YvsHD', 'YvsHD', 'YvsHD', 'ZvsHD', 'ZvsHD', 'ZvsHD', 'ZvsHD'),
protein = c('A', 'B', 'C', 'D', 'A', 'D', 'G', 'F', 'A', 'C', 'D', 'R'),
logFC = c(-1, 2 , 4, 5, 2, 6, -3, 2, 4, 6, 1, 2))

extremes <- df %>% group_by(group) %>% slice_max(order_by = logFC, n = 2, preserve = T)%>% pull(protein)

Plots

   df %>% 
    ggplot(aes(x= group, y = logFC)) +
    geom_violin() +   
    geom_point() 

    df %>% 
    ggplot(aes(x= group, y = logFC)) +
    geom_violin() +   
    geom_point() +
    geom_label_repel(aes(label= ifelse(protein %in% extremes, as.character(protein), NA),hjust=0, vjust=0))

The goal is to have a plot, where the 2 most extreme values of each group are labeled with the 'protein' tag. Really cool would be if this works for the extremely low as well as for the high values/group.

Thank you very much!

Sebastian

Upvotes: 1

Views: 1318

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388982

You can create a label column which has protein value only if it is minimum or maximum logFC value in each group.

library(dplyr)
library(ggplot2)
library(ggrepel)

df %>%
  group_by(group) %>%
  mutate(label = ifelse(logFC %in% range(logFC), protein, '')) %>%
  ggplot(aes(x= group, y = logFC, label = label)) +
  geom_violin() +   
  geom_point() +
  geom_label_repel(hjust=0, vjust=0)

enter image description here


To label top and bottom n values you can use dense_rank function.

n <- 2

df %>%
  group_by(group) %>%
  mutate(min_rank = dense_rank(logFC), 
         max_rank = dense_rank(-logFC)) %>%
  mutate(label = ifelse(min_rank <= n | max_rank <= n, protein, '')) %>%
  ggplot(aes(x= group, y = logFC, label = label)) +
  geom_violin() +   
  geom_point() +
  geom_label_repel(hjust=0, vjust=0)

Upvotes: 2

Richard Telford
Richard Telford

Reputation: 9923

This is a slightly different solution from Ronak Shah in that it uses rank to find the two most extreme values (defined as difference from the mean). It is not necessary that one extreme is high and one low: both extreme values could be high, or both low.

df <- df %>% group_by(group) %>% 
  mutate(
    logFC_demean = scale(logFC, scale = FALSE),
    label = ifelse(rank(-abs(logFC_demean), ) <= 2, protein, "")) 


ggplot(df, aes(x= group, y = logFC, label = label)) +
  geom_violin() +   
  geom_point() +
  geom_label_repel(hjust=0, vjust=0)

Upvotes: 1

Related Questions