Reputation: 347
I am working on a plot using ggstatsplot
R
package. After completing the plot I have an issue. This is the code for my plot:
library(ggstatsplot)
#Data
data("movies_long")
movies_long
#Plot
Myplot <- ggbarstats(
data = movies_long,
x = mpaa,
y = year,
title = "Distribution of movies per year",
xlab = "year",
legend.title = "",
ggplot.component = list(ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 2))),
palette = "Set2"
)
The output is next:
The plot is very useful but it has two problems. First is the overlap of p-values at top of bars. Second is the size n which also has overlap, how could I improve the p-values labels at the top of bars and also the size n at bottom?
Is there any way to take these p-values and translate in p<0.001, p<0.1 using confidence levels?
Many thanks for your help. In the extreme case, is there any way to know how to compute the p-values (which test is being applied) at each bar, so that I can replicate it and create a similar plot using ggplot2
?
Upvotes: 1
Views: 851
Reputation: 3690
I challenge your statement that this plot is very useful. Apart from the overlapping labels, there are statistical issues including:
Here is the same plot restricted to movies released between 1991 and 2005. It's better, though still peppered with overabundance of uncorrected p-values.
Note: I've made the plot large enough to accommodate all those labels. Now the text is a bit small but with some trial and error you can get the plot height & width "right".
As for the p-values on top of each bar... There is a handy way to extract the data visualized in a ggplot:
p$data
#> # A tibble: 41 × 5
#> year mpaa counts perc .label
#> <fct> <fct> <int> <dbl> <chr>
#> 1 1991 R 5 83.3 83%
#> 2 1992 R 6 100 100%
#> 3 1993 R 5 83.3 83%
#> 4 1994 R 20 80 80%
#> 5 1995 R 59 69.4 69%
#> 6 1996 R 63 60.6 61%
#> 7 1997 R 71 62.3 62%
#> 8 1998 R 82 61.7 62%
#> 9 1999 R 98 61.3 61%
#> 10 2000 R 77 48.4 48%
#> # … with 31 more rows
Now that we've confirmed that each bar is a simple summary of the number of movies by audience rating (R, PG, PG-13), we can quickly check that the p-value comes from a chi-squared test on the counts by rating:
p$data %>%
filter(
year == 2000
)
#> # A tibble: 3 × 5
#> year mpaa counts perc .label
#> <fct> <fct> <int> <dbl> <chr>
#> 1 2000 R 77 48.4 48%
#> 2 2000 PG-13 64 40.3 40%
#> 3 2000 PG 18 11.3 11%
chisq.test(c(77, 64, 18))
#>
#> Chi-squared test for given probabilities
#>
#> data: c(77, 64, 18)
#> X-squared = 36.264, df = 2, p-value = 1.335e-08
Upvotes: 3