Reputation: 535
This is more of a "how would you do it" than a "how to do it" question.
I have two groups, "a" and "b". "b" consists of responses that are normally distributed, more or less, bounded between 0-1. "a", however, is mostly 1-s.
The process that generated the data is a questionnaire. People in group "b" made mistakes, people in group "a" mostly figured it out.
How do I visualise these data side by side? Boxplots are messed up because the median of one of the groups is basically 1. Violin plots mess up the widths.
Here is a reproducible example with the rough idea.
library(tidyverse)
d = tibble(
val= c(
rnorm(100, mean = 0.5, sd = 0.25),
rnorm(5, mean = 0.5, sd = 0.25),
rep(1, 95)
),
var = c(
rep('a', 100), rep('b', 100)
)
) %>%
filter(
val >= 0,
val <= 1
)
d %>%
ggplot(aes(x = var, y = val)) +
geom_jitter() +
geom_violin()
# No.
d %>%
ggplot(aes(x = var, y = val)) +
geom_boxplot()
# No-no.
Upvotes: 0
Views: 200
Reputation: 51
To my mind, interesting visualizations can also be accomplished with the nice {ggridges} package, for example with
d %>%
ggplot() +
aes(x = val, y = var) +
ggridges::stat_density_ridges(quantile_lines = TRUE) +
ggridges::theme_ridges()
Edit: yes, the smoothing on the tails is indeed unfortunate in that case, maybe using geom_density_riges(stat = "binline")
instead of stat_density_ridges()
would do the trick here? But again, not optimal I guess...
Upvotes: 2
Reputation: 535
Thanks y'all. I don't think there's a perfect answer to this, but I will go with multiple posters and, during write-up, discuss the oddities of the data and speculate on the process responsible in detail.
Upvotes: 1
Reputation: 7592
One way to address this is to use a violin plot (or box plot) while ignoring the 1's, and then just adding the points for the ones separately to clarify that they're there. You'll need to explain this in a caption, though. The fact is the data is what it is, and if you have bad data, no amount of playing around with visualization is going to fix that.
d %>% filter(!(val==1 & var=="b")) %>%
ggplot(aes(x = var, y = val)) +
geom_jitter() +
geom_violin() + geom_jitter(data=filter(d,val==1 & var=="b"),height=0)
Although the best option really depends on what the data represents, and why so many people made the mistake of choosing 1. Did they mistake it for a binary choice between 0 and 1 and preferred the 1? If so, it might be a good idea to randomly spread those ones over the top half of the scale (by subtracting a random number between 0 and 0.5 from each), and graphing it this way. But that's quite an assumption to make, and I would be very wary of manipulating data in such a way. It might be that your best option is just give up on this as unusable data, and move on (or collapse all the data into the binary, if that's usable).
Upvotes: 1