SKOR2
SKOR2

Reputation: 39

Highlight Points in Scatterplot w/ ggplot2

I need to create a qq plot of -log10 p-values in ggplot2 where a subset of 137 points ("targets") are highlighted in gold using a colorblind-friendly palette I'm using called cbbPalette. I cannot do this in an alternate package because I eventually need to combine multiple qq plots into a grid using grid.arrange from the gridExtra package that works with ggplot2.

Setup:

library(ggplot2)
library(reshape2)
cbbPalette <- c("#E69F00", "#000000") #part of my palette; gold & black
set.seed(100)

The data consists of 100,137 p-values, 137 of which are targets:

p_values = c(
  runif(100000, min = 0, max = 1),
  runif(132, min = 1e-7, max = 1),
  c(6e-20, 6e-19, 7e-9, 7.5e-9, 4e-8)
)

#labels for the p-values
names_letters <-
  do.call(paste0, replicate(2, sample(LETTERS, 100137, TRUE), FALSE))
names = paste0(names_letters, sprintf("%04d", sample(9999, 100137, TRUE)))
targets = names[100001:100137] #last 137 are targets

df = as.data.frame(p_values)
df$names = names
df <-
  df[sample(nrow(df)), ] #shuffles the df to place targets randomly w/in
df$Category = ifelse(df$names %in% targets, "Target", "Non-Target")

Appearance of Data:

head(df, 4) 
           p_values  names   Category
89863 0.4821147 NZ3385 Non-Target
20209 0.3998835 SQ3793 Non-Target
29200 0.7893478 ZT5497 Non-Target
71623 0.3459360 QF5311 Non-Target

Melted df Using reshape2 with Observed (o) & Expected (e) -log10 p-values:

df.m = melt(df)
df.m$o = -log10(sort(df.m$value, decreasing = F))
df.m$e = -log10(1:nrow(df.m) / nrow(df.m))

Appearance of Melted df:

head(df.m,4)
   names   Category variable     value         o        e
1 NZ3385 Non-Target p_values 0.4821147 19.221849 5.000595
2 SQ3793 Non-Target p_values 0.3998835 18.221849 4.699565
3 ZT5497 Non-Target p_values 0.7893478  8.154902 4.523473
4 QF5311 Non-Target p_values 0.3459360  8.124939 4.398535

QQ-plot

df_qq = ggplot(df.m, aes(e, o)) +
  geom_point(aes(color = Category)) +
  scale_colour_manual(values = cbbPalette) +
  geom_abline(intercept = 0, slope = 1) +
  ylab("Observed -log[10](p)") +
  xlab("Theoretical -log[10](p)")

I then get a qq with no highlighting of my 137 targets.

QQ-plot I get w/ no highlighting of 137 targets

Upvotes: 1

Views: 1463

Answers (2)

camille
camille

Reputation: 16832

If you want to avoid having to split your dataframe into two calls to geom_point, you can order the data by the Category column first, then pipe it into ggplot. For just these two category values, you could arrange pretty simply:

df.m %>%
    arrange(Category) %>%
    ggplot(...)

which will put your data in alphabetical order with Non-Target observations, then Target ones. Points get drawn in order, so this will put points in the target category on top.

To have more control over the ordering, you can make Category a factor, and set the levels explicitly, then arrange by the factor order:

df.m %>%
    mutate(Category = as.factor(Category) %>% fct_relevel("Target")) %>%
    arrange(desc(Category)) %>%
    ggplot(...)

I'm using fct_relevel from the forcats package, just because it's a really easy way to manipulate factor levels; you could order levels with base R as well. fct_relevel puts the Target level first, so when I arrange by Category, I'm doing it in reverse, so that again Target gets drawn last.

Hope that makes sense!

Upvotes: 1

Marius
Marius

Reputation: 60060

You can draw the targets in a separate geom_point() call after the non-targets, the geoms are plotted in order so the targets end up on top:

cbbPalette <- c(Target = "#E69F00", `Non-Target` = "#000000")
df_qq = ggplot(df.m, aes(e, o)) +
    geom_abline(intercept = 0, slope = 1) +
    geom_point(aes(color = Category), data = df.m[df.m$Category == "Non-Target", ]) +
    geom_point(aes(color = Category), data = df.m[df.m$Category == "Target", ]) +
    scale_colour_manual(values = cbbPalette) +
    ylab("Observed -log[10](p)") +
    xlab("Theoretical -log[10](p)")

I've also added names to your palette to make sure the right colours are attached to each category, when changing the order of the geom_point() calls this can get mixed up otherwise.

Result:

enter image description here

Upvotes: 1

Related Questions