Reputation: 11
I have a dataframe with a categorical variable and a few logical variables. I’ve created an example using the iris dataset:
# Create a new dataframe showing the percentage each variable that is considered long
# Numerical variables are first converted to logical based on if they are above or below a certain threshold, then the percentage of each variable that is true is calculated
long_data <- iris %>%
mutate(
Long.Sepal.Length = Sepal.Length > 6,
Long.Sepal.Width = Sepal.Width > 3,
Long.Petal.Length = Petal.Length > 2.5,
Long.Petal.Width = Petal.Width > 0.3
) %>%
group_by(Species) %>%
summarise(
Long.Sepal.Length = mean(Long.Sepal.Length) * 100,
Long.Sepal.Width = mean(Long.Sepal.Width) * 100,
Long.Petal.Length = mean(Long.Petal.Length) * 100,
Long.Petal.Width = mean(Long.Petal.Width) * 100
) %>%
pivot_longer(
cols = starts_with("Long"),
names_to = "Measurement",
values_to = "Percentage_True"
)
> long_data
# A tibble: 12 × 3
Species Measurement Percentage_True
<fct> <chr> <dbl>
1 setosa Long.Sepal.Length 0
2 setosa Long.Sepal.Width 84
3 setosa Long.Petal.Length 0
4 setosa Long.Petal.Width 18
5 versicolor Long.Sepal.Length 40
6 versicolor Long.Sepal.Width 16
7 versicolor Long.Petal.Length 100
8 versicolor Long.Petal.Width 100
9 virginica Long.Sepal.Length 82
10 virginica Long.Sepal.Width 34
11 virginica Long.Petal.Length 100
12 virginica Long.Petal.Width 100
I want to create a bar plot to compare the percentage of TRUE values for each logical variable by species. Here’s how I plan to create the plot:
# Create the bar plot
ggplot(long_data, aes(x = Measurement, y = Percentage_True, fill = Species)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
x = "Measurement",
y = "% True",
fill = "Species"
)
Plot I want to add significance bars to
How can I add significance bars above each group of bars?
I’ve tried using ggpubr functions, but I’m not sure how to make it work. I envison the plot to look something like the final graph on this page:https://rpkgs.datanovia.com/ggpubr/reference/geom_pwc.html But I cannot get it to work with my logical dataset, as the plot is not using the original data but the long format.
example of what I want my significance bars to look like
Upvotes: 1
Views: 41
Reputation: 19459
One (hacky) way is to use the ggsignif package, and I'm sure there is better way than what I've done.
I've done this in two stages. The first stage is to save the mutated columns before pivoting because you need this data to perform the statistical tests.
iris2 <- iris %>%
mutate(
Species=as.character(Species),
Long.Sepal.Length = Sepal.Length > 6,
Long.Sepal.Width = Sepal.Width > 3,
Long.Petal.Length = Petal.Length > 2.5,
Long.Petal.Width = Petal.Width > 0.3
)
Then the summarise and pivot (stage 2).
long_data <- iris2 %>%
group_by(Species) %>%
summarise(
Long.Sepal.Length = mean(Long.Sepal.Length) * 100,
Long.Sepal.Width = mean(Long.Sepal.Width) * 100,
Long.Petal.Length = mean(Long.Petal.Length) * 100,
Long.Petal.Width = mean(Long.Petal.Width) * 100
) %>%
pivot_longer(
cols = starts_with("Long"),
names_to = "Measurement",
values_to = "Percentage_True"
)
Now the tests. You are comparing proportions here. For example, the proportion of long sepals among the 3 species.
xtabs(~Species + Long.Sepal.Width, data=iris2)
Long.Sepal.Width
Species FALSE TRUE
setosa 8 42
versicolor 42 8
virginica 33 17
I believe the correct pairwise test to use here is a proportion test. As far as I'm aware, this test (prop.test
) is not available in any of the ggplot-type packages, including ggsignif - I'd be happy if someone could prove me wrong on that. Instead of prop.test
(or chisq.test
) from the stats package, I've used the pairwise_prop_test
function from the rstatix
package for convenience. I've only tested the sepal lengths here for simplicity.
p.sw <- pairwise_prop_test(xtabs(~Species + Long.Sepal.Width,
data=iris2))
# A tibble: 3 × 5
group1 group2 p p.adj p.adj.signif
* <chr> <chr> <dbl> <dbl> <chr>
1 setosa versicolor 4.11e-11 1.23e-10 ****
2 setosa virginica 1.06e- 6 2.12e- 6 ****
3 versicolor virginica 6.47e- 2 6.47e- 2 ns
Now manually add the adjusted p-values to the plot. The hacky part is getting the x-positions (xmin, xmax) in the right places.
p <- ggplot(long_data, aes(x = Measurement, y = Percentage_True, fill = Species)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
x = "Measurement",
y = "% True",
fill = "Species"
) +
scale_y_continuous(limits=c(0,120)) +
theme_minimal()
p + geom_signif(
textsize = 3,
tip_length=0.02,
annotations = p.sw$p.adj,
y_position = 105 + c(0, 7, 14),
xmin = 3.7 + c(0, 0, 0.3),
xmax = 4.0 + c(0, 0.3, 0.3)
)
Upvotes: 0