Reputation: 545
In a ggplot combination of geom_violin and geom_point, I would like to label the extreme values of each group. However, this should be strictly group-based. My solution so far just gives me the extreme values of every group and labels each of them in every group. As in the real data I have 10 groups and 3 extremes this makes the graph unreadable.
df <- data.frame(group = c('XvsHD', 'XvsHD', 'XvsHD', 'XvsHD', 'YvsHD', 'YvsHD', 'YvsHD', 'YvsHD', 'ZvsHD', 'ZvsHD', 'ZvsHD', 'ZvsHD'),
protein = c('A', 'B', 'C', 'D', 'A', 'D', 'G', 'F', 'A', 'C', 'D', 'R'),
logFC = c(-1, 2 , 4, 5, 2, 6, -3, 2, 4, 6, 1, 2))
extremes <- df %>% group_by(group) %>% slice_max(order_by = logFC, n = 2, preserve = T)%>% pull(protein)
df %>%
ggplot(aes(x= group, y = logFC)) +
geom_violin() +
geom_point()
df %>%
ggplot(aes(x= group, y = logFC)) +
geom_violin() +
geom_point() +
geom_label_repel(aes(label= ifelse(protein %in% extremes, as.character(protein), NA),hjust=0, vjust=0))
The goal is to have a plot, where the 2 most extreme values of each group are labeled with the 'protein' tag. Really cool would be if this works for the extremely low as well as for the high values/group.
Thank you very much!
Sebastian
Upvotes: 1
Views: 1318
Reputation: 388982
You can create a label
column which has protein
value only if it is minimum or maximum logFC
value in each group.
library(dplyr)
library(ggplot2)
library(ggrepel)
df %>%
group_by(group) %>%
mutate(label = ifelse(logFC %in% range(logFC), protein, '')) %>%
ggplot(aes(x= group, y = logFC, label = label)) +
geom_violin() +
geom_point() +
geom_label_repel(hjust=0, vjust=0)
To label top and bottom n
values you can use dense_rank
function.
n <- 2
df %>%
group_by(group) %>%
mutate(min_rank = dense_rank(logFC),
max_rank = dense_rank(-logFC)) %>%
mutate(label = ifelse(min_rank <= n | max_rank <= n, protein, '')) %>%
ggplot(aes(x= group, y = logFC, label = label)) +
geom_violin() +
geom_point() +
geom_label_repel(hjust=0, vjust=0)
Upvotes: 2
Reputation: 9923
This is a slightly different solution from Ronak Shah in that it uses rank
to find the two most extreme values (defined as difference from the mean). It is not necessary that one extreme is high and one low: both extreme values could be high, or both low.
df <- df %>% group_by(group) %>%
mutate(
logFC_demean = scale(logFC, scale = FALSE),
label = ifelse(rank(-abs(logFC_demean), ) <= 2, protein, ""))
ggplot(df, aes(x= group, y = logFC, label = label)) +
geom_violin() +
geom_point() +
geom_label_repel(hjust=0, vjust=0)
Upvotes: 1