petyar
petyar

Reputation: 535

Visualising two very different distributions in one plot

This is more of a "how would you do it" than a "how to do it" question.

I have two groups, "a" and "b". "b" consists of responses that are normally distributed, more or less, bounded between 0-1. "a", however, is mostly 1-s.

The process that generated the data is a questionnaire. People in group "b" made mistakes, people in group "a" mostly figured it out.

How do I visualise these data side by side? Boxplots are messed up because the median of one of the groups is basically 1. Violin plots mess up the widths.

Here is a reproducible example with the rough idea.

library(tidyverse)

d = tibble(
  val= c(
    rnorm(100, mean = 0.5, sd = 0.25),
    rnorm(5, mean = 0.5, sd = 0.25),
    rep(1, 95)
  ),
  var = c(
    rep('a', 100), rep('b', 100) 
  )
) %>% 
  filter(
    val >= 0,
    val <= 1
         )

d %>% 
  ggplot(aes(x = var, y = val)) +
  geom_jitter() +
  geom_violin()

# No.

d %>% 
  ggplot(aes(x = var, y = val)) +
  geom_boxplot()

# No-no.

Upvotes: 0

Views: 200

Answers (3)

m_ky
m_ky

Reputation: 51

To my mind, interesting visualizations can also be accomplished with the nice {ggridges} package, for example with

d %>%
  ggplot() +
  aes(x = val, y = var) +
  ggridges::stat_density_ridges(quantile_lines = TRUE) +
  ggridges::theme_ridges()

enter image description here

Edit: yes, the smoothing on the tails is indeed unfortunate in that case, maybe using geom_density_riges(stat = "binline") instead of stat_density_ridges() would do the trick here? But again, not optimal I guess...

Upvotes: 2

petyar
petyar

Reputation: 535

Thanks y'all. I don't think there's a perfect answer to this, but I will go with multiple posters and, during write-up, discuss the oddities of the data and speculate on the process responsible in detail.

Upvotes: 1

iod
iod

Reputation: 7592

One way to address this is to use a violin plot (or box plot) while ignoring the 1's, and then just adding the points for the ones separately to clarify that they're there. You'll need to explain this in a caption, though. The fact is the data is what it is, and if you have bad data, no amount of playing around with visualization is going to fix that.

d %>% filter(!(val==1 & var=="b"))  %>% 
    ggplot(aes(x = var, y = val)) +
    geom_jitter() +
    geom_violin() + geom_jitter(data=filter(d,val==1 & var=="b"),height=0)

enter image description here

Although the best option really depends on what the data represents, and why so many people made the mistake of choosing 1. Did they mistake it for a binary choice between 0 and 1 and preferred the 1? If so, it might be a good idea to randomly spread those ones over the top half of the scale (by subtracting a random number between 0 and 0.5 from each), and graphing it this way. But that's quite an assumption to make, and I would be very wary of manipulating data in such a way. It might be that your best option is just give up on this as unusable data, and move on (or collapse all the data into the binary, if that's usable).

Upvotes: 1

Related Questions