Reputation:
I frequently make boxplots where some of the categories are quite small and others have plentiful data, superimposed with jittered raw datapoints. I'm looking for a reliable way to hide the box and whiskers for categories that are very small (N<5). The goal is that those little categories would show just the raw data using a geom_point() layer, but the categories where it makes sense would get the box-and-whisker treatment. The thing that seemed obvious to me, mapping alpha in the geom_boxplot() layer to a factor variable based on N, does not work because alpha only controls the fill and maybe the outliers in geom_boxplot, not the box and whiskers.
I have found a kludgey solution in the past that worked as long as I was willing to waste the color parameter on this problem. However, often I want to actually use color for something else, and mapping it twice leads to gnarly output. Another kludgey solution that occurs to me is using a data subset from which small categories have been deleted - the problem with this plan is that it won't correctly handle situations when these categories are subject to position_dodge() (as the dodge will "see" too few categories).
Minimal example below.
df <- data.frame(group=factor(sample(c("A","B"), size=110, replace=TRUE)),
sex=factor(c(rep("M",50), rep("F", 50), rep("NB", 10))),
height=c(rnorm(50, 70, 6), rnorm(50, 63, 6), rnorm(10, 65, 6)))
dfsub <- filter(df, !(sex=="NB" & group=="A"))
ggplot(df, aes(x=group, y=height, colour=sex)) +
geom_boxplot(data=dfsub) +
geom_point(position=position_jitterdodge(jitter.width=0.2))
Upvotes: 2
Views: 1945
Reputation: 1580
I made a second column for your height data where values from small sample size groups are replaced with NA
. When plotting the data, use the original height column as the y aesthetic for points, and the new column with NA
values for small groups as the y aesthetic for boxplots.
To make boxplots and points line up correctly, use geom_boxplot(position_dodge(preserve = "single"))
to tell ggplot to maintain a constant width for boxplots even with missing data.
require(tidyverse)
df <- data.frame(group = factor(sample(c("A", "B"), size = 110, replace = TRUE)),
sex = factor(c(rep("M", 50), rep("F", 50), rep("NB", 10))),
height = c(rnorm(50, 70, 6), rnorm(50, 63, 6), rnorm(10, 65, 6)))
n <- df %>% #calculate sample sizes
group_by(group, sex) %>%
summarize(n = n())
df <- left_join(df, n) %>% #join sample sizes to df
#make second height column to use for boxplots: NA values if n is too small
mutate(boxplot_height = ifelse(n < 5, NA, height))
ggplot(df, aes(x = group, colour = sex)) +
#use height column that has groups with n < 5 coded as NA to plot boxplots
geom_boxplot(aes(y = boxplot_height),
#preserve = "single" maintains constant width of boxes
position = position_dodge(preserve = "single")) +
geom_point(aes(y = height), #use all height data as y variable for points
position = position_jitterdodge(jitter.width = 0.2))
Upvotes: 1
Reputation: 3438
Okay, I don't think this way is necessarily any better than your current options, but... You could split your df into dfs for the boxplot and the scatterplot, and modify the values of the data you want removed from the boxplot to be way out of range (e.g., 1000 here). Then plot both, and finally use coord_cartesian
to zoom in on the relevant section.
To create the df_box
, we group by group
and sex
, and change the values of groups with < 5 datapoints to 1000 (so that we don't have to hard-code in which values to change).
df <- data.frame(group=factor(sample(c("A","B"), size=110, replace=TRUE)),
sex=factor(c(rep("M",50), rep("F", 50), rep("NB", 10))),
height=c(rnorm(50, 70, 6), rnorm(50, 63, 6), rnorm(10, 65, 6)))
df_box <- df %>%
group_by(group, sex) %>%
mutate(temp = ifelse(n() < 5, 1000, 1)) %>%
ungroup() %>%
mutate(height = ifelse(temp == 1000, 1000, height)) %>%
select(-temp)
ggplot(df, aes(x=group, y=height, colour=sex)) +
geom_boxplot(data=df_box) +
geom_point(position=position_jitterdodge(jitter.width=0.2)) +
coord_cartesian(ylim=c(50,90))
Upvotes: 1