ChrisW
ChrisW

Reputation: 1295

violin_plot() with continuous axis for grouping variable?

The grouping variable for creating a geom_violin() plot in ggplot2 is expected to be discrete for obvious reasons. However my discrete values are numbers, and I would like to show them on a continuous scale so that I can overlay a continuous function of those numbers on top of the violins. Toy example:

library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
             y = rnorm(1000, mean = x))
ggplot(df) + geom_violin(aes(x=factor(x), y=y))

This works as you'd imagine: violins with their x axis values (equally spaced) labelled 1, 2, and 5, with their means at y=1,2,5 respectively. I want to overlay a continuous function such as y=x, passing through the means. Is that possible? Adding + scale_x_continuous() predictably gives Error: Discrete value supplied to continuous scale. A solution would presumably spread the violins horizontally by the numeric x values, i.e. three times the spacing between 2 and 5 as between 1 and 2, but that is not the only thing I'm trying to achieve - overlaying a continuous function is the key issue. If this isn't possible, alternative visualisation suggestions are welcome. I know I could replace violins with a simple scatter plot to give a rough sense of density as a function of y for a given x.

Upvotes: 1

Views: 1686

Answers (2)

Mirjam
Mirjam

Reputation: 166

The functionality to plot violin plots on a continuous scale is directly built into ggplot.

The key is to keep the original continuous variable (instead of transforming it into a factor variable) and specify how to group it within the aesthetic mapping of the geom_violin() object. The width of the groups can be modified with the cut_width argument, depending on the data at hand.

library(tidyverse)

df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
             y = rnorm(1000, mean = x))

ggplot(df, aes(x=x, y=y)) +
  geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
  geom_smooth(method = 'lm')

Minimal working example

By using this approach, all geoms for continuous data and their varying functionalities can be combined with the violin plots, e.g. we could easily replace the line with a loess curve and add a scatter plot of the points.

ggplot(df, aes(x=x, y=y)) +
  geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
  geom_smooth(method = 'loess') +
  geom_point()

Extended working example

More examples can be found in the ggplot helpfile for violin plots.

Upvotes: 3

stefan
stefan

Reputation: 123963

Try this. As you already guessed, spreading the violins by numeric values is the key to the solution. To this end I expand the df to include all x values in the interval min(x) to max(x) and use scale_x_discrete(drop = FALSE) so that all values are displayed.

Note: Thanks @ChrisW for the more general example of my approach.

library(tidyverse)

set.seed(42) 

df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T), y = rnorm(1000, mean = x^2))
# y = x^2  
# add missing x values 
x.range <- seq(from=min(df$x), to=max(df$x)) 
df <- df %>% right_join(tibble(x = x.range))
#> Joining, by = "x"
# Whatever the desired continuous function is: 
df.fit <- tibble(x = x.range, y=x^2) %>% 
  mutate(x = factor(x))

ggplot() + 
  geom_violin(data=df, aes(x = factor(x, levels = 1:5), y=y)) + 
  geom_line(data=df.fit, aes(x, y, group=1), color = "red") + 
  scale_x_discrete(drop = FALSE)
#> Warning: Removed 2 rows containing non-finite values (stat_ydensity).

Created on 2020-06-11 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions