Reputation: 535

Visualising two very different distributions in one plot

This is more of a "how would you do it" than a "how to do it" question.

I have two groups, "a" and "b". "b" consists of responses that are normally distributed, more or less, bounded between 0-1. "a", however, is mostly 1-s.

The process that generated the data is a questionnaire. People in group "b" made mistakes, people in group "a" mostly figured it out.

How do I visualise these data side by side? Boxplots are messed up because the median of one of the groups is basically 1. Violin plots mess up the widths.

Here is a reproducible example with the rough idea.

library(tidyverse)

d = tibble(
  val= c(
    rnorm(100, mean = 0.5, sd = 0.25),
    rnorm(5, mean = 0.5, sd = 0.25),
    rep(1, 95)
  ),
  var = c(
    rep('a', 100), rep('b', 100) 
  )
) %>% 
  filter(
    val >= 0,
    val <= 1
         )

d %>% 
  ggplot(aes(x = var, y = val)) +
  geom_jitter() +
  geom_violin()

# No.

d %>% 
  ggplot(aes(x = var, y = val)) +
  geom_boxplot()

# No-no.

Upvotes: 0

Answers (3)

m_ky

Reputation: 51

To my mind, interesting visualizations can also be accomplished with the nice {ggridges} package, for example with

d %>%
  ggplot() +
  aes(x = val, y = var) +
  ggridges::stat_density_ridges(quantile_lines = TRUE) +
  ggridges::theme_ridges()

enter image description here

Edit: yes, the smoothing on the tails is indeed unfortunate in that case, maybe using geom_density_riges(stat = "binline") instead of stat_density_ridges() would do the trick here? But again, not optimal I guess...

Upvotes: 2

petyar

Reputation: 535

Thanks y'all. I don't think there's a perfect answer to this, but I will go with multiple posters and, during write-up, discuss the oddities of the data and speculate on the process responsible in detail.

Upvotes: 1

iod

Reputation: 7592

One way to address this is to use a violin plot (or box plot) while ignoring the 1's, and then just adding the points for the ones separately to clarify that they're there. You'll need to explain this in a caption, though. The fact is the data is what it is, and if you have bad data, no amount of playing around with visualization is going to fix that.

d %>% filter(!(val==1 & var=="b"))  %>% 
    ggplot(aes(x = var, y = val)) +
    geom_jitter() +
    geom_violin() + geom_jitter(data=filter(d,val==1 & var=="b"),height=0)

Although the best option really depends on what the data represents, and why so many people made the mistake of choosing 1. Did they mistake it for a binary choice between 0 and 1 and preferred the 1? If so, it might be a good idea to randomly spread those ones over the top half of the scale (by subtracting a random number between 0 and 0.5 from each), and graphing it this way. But that's quite an assumption to make, and I would be very wary of manipulating data in such a way. It might be that your best option is just give up on this as unusable data, and move on (or collapse all the data into the binary, if that's usable).

Upvotes: 1

Visualising two very different distributions in one plot

Answers (3)

Related Questions