Reputation: 33
My data looks like this:
ID Date Reduction Collected Provided Freq Gender
1 AAA016000 2018-04-10 0 0 7 1 <NA>
2 AAA059717 2017-03-21 1 0 45 10 Female
3 AAA059717 2017-04-22 0 0 10 10 Female
4 AAA059717 2017-05-09 0 0 10 2 Female
5 AAA059717 2017-06-09 1 0 40 6 Female
6 AAA059717 2018-07-03 NA 180 200 35 Female
7 AAA059717 2018-09-26 NA 10 30 15 Female
8 AAA059717 2018-09-26 1 NA NA NA Female
9 AAA059717 2018-10-12 NA 0 20 3 Female
10 AAA059717 2018-11-07 NA 30 50 20 Female
11 AAA059717 2018-11-07 0 NA NA NA Female
12 AAA059717 2018-11-08 NA 2 20 10 Female
'data.frame': 190122 obs. of 7 variables:
$ ID : chr "AAA016000" "AAA059717" "AAA059717" "AAA059717" ...
$ Date : Date, format: "2018-04-10" "2017-03-21" "2017-04-22" "2017-05-09" ...
$ Reduction : num 0 1 0 0 1 NA NA 1 NA NA ...
$ Collected : num 0 0 0 0 0 180 10 NA 0 30 ...
$ Provided : num 7 45 10 10 40 200 30 NA 20 50 ...
$ Freq : num 1 10 10 2 6 35 15 NA 3 20 ...
$ Gender : chr NA "Female" "Female" "Female" ...
And when i try to find out if higher freq also has higher Provided, i did this:
ggplot(data = df, aes(x = Freq, y = Provided)) +
geom_point()+
geom_line()
But the graph doesn't look right??
How do i make a better graph to visualize if higher freq has higher provided than lower freq? and lastly, How do I visualize whether a freq of 10 or over is Provided more often than freq under 10? Thank you for your response, I apreciate it.
Upvotes: 1
Views: 64
Reputation: 10637
There is a strong significant linear correlation between Freq
and Provided
(Pearson, effect size R = 0.89, p < 0.001).
Frequencies above or equal to 10 have not significantly higher provided values (Wilcoxon rank sum test, p = 0.16). Keep in mind that this discretization of the Freq variable into two binary categories (high and low) is often arbitrary and significance can be highly depended on the threshold (here 10).
library(tidyverse)
library(ggpubr)
df <- tribble(
~row_id, ~ID, ~Date, ~Reduction, ~Collected, ~Provided, ~Freq, ~Gender,
1, "AAA016000", " 2018-04-10", 0, 0, 7, 1, NA,
2, "AAA059717", " 2017-03-21", 1, 0, 45, 10, "Female",
3, "AAA059717", "2017-04-22", 0, 0, 10, 10, "Female",
4, "AAA059717", "2017-05-09", 0, 0, 10, 2, "Female",
5, "AAA059717", "2017-06-09", 1, 0, 40, 6, "Female",
6, "AAA059717", "2018-07-03", NA, 180, 200, 35, "Female",
7, "AAA059717", "2018-09-26", NA, 10, 30, 15, "Female",
8, "AAA059717", "2018-09-26", 1, NA, NA, NA, "Female",
9, "AAA059717", "2018-10-12", NA, 0, 20, 3, "Female",
10, "AAA059717", "2018-11-07", NA, 30, 50, 20, "Female",
11, "AAA059717", "2018-11-07", 0, NA, NA, NA, "Female",
12, "AAA059717", "2018-11-08", NA, 2, 20, 10, "Female"
)
df %>%
ggplot(aes(Freq, Provided)) +
geom_point() +
stat_smooth(method = "lm") +
stat_cor(method = "pearson")
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 2 rows containing non-finite values (stat_smooth).
#> Warning: Removed 2 rows containing non-finite values (stat_cor).
#> Warning: Removed 2 rows containing missing values (geom_point).
df %>%
mutate(high_Freq = Freq >= 10) %>%
filter(!is.na(high_Freq)) %>%
ggplot(aes(high_Freq, Provided)) +
geom_boxplot() +
stat_compare_means()
Created on 2021-11-10 by the reprex package (v2.0.1)
Upvotes: 1