Reputation: 21
I'm trying to find an elegant way to preform a t-test comparing the means of 6 groups of data, preferably using dplyr/tidyverse. My data looks something along the lines of:
Grouping Variable Numerical variable
A 5.6
A 2.3
A 4.8
B 7.3
B 6.9
B 5.8
C 1.4
C 6.4
I know I can do something like:
df_a <- df %>% filter(grouping_variable == 'A')
df_b <- df %>% filter(grouping_variable == 'B')
a_b <- t.test(df_a,df_b)$p.value
And then repeat that for every variable combo. There are only 6 grouping variables, so the above isn't out of the question, but there has to be a simpler way along the lines of:
df %>% group_by(grouping_variable)%>%
t.test(of each on each)
Maybe something with tidy?
My end result is to get a tibble along the lines of
A B C D E F
A .34 .4 .235 ...
B .03 .34 .454...
Upvotes: 2
Views: 402
Reputation: 2218
You are looking for pairwise.t.test. It allows you to mention a p-value adjustment method as well as the alternative hypothesis. Refer R documentation for details.
Usage:
pairwise.t.test(x, g, p.adjust.method = p.adjust.methods,
pool.sd = !paired, paired = FALSE,
alternative = c("two.sided", "less", "greater"),
...)
For your case, you can do something like:
pairwise.ttest <- pairwise.t.test(x = df$num_var, g = df$group_var)
pairwise.ttest$p.value
Upvotes: 0
Reputation: 376
This can be done cleanly using the cross
and map
functions from purrr.
Sample data:
df <- tibble(group_var = rep(c("A", "B", "C"), times = 5),
num_var = rnorm(15))
df
# A tibble: 15 x 2
group_var num_var
<chr> <dbl>
1 A 1.66
2 B -0.694
3 C -0.680
4 A 1.96
5 B -0.380
6 C -0.941
7 A 1.02
8 B 0.0476
9 C 0.770
10 A 1.41
11 B 0.137
12 C -0.816
13 A -0.478
14 B 0.374
15 C -0.619
Use cross
to create a dataframe with all the variable combinations:
test_results <- cross_df(list(var1 = c("A", "B", "C"), var2 = c("A", "B", "C")))
Add column with ttest results:
test_results <- test_results %>%
mutate(ttest = map2_dbl(var1, var2,
~ t.test(df %>% filter(group_var == .x) %>% .$num_var,
df %>% filter(group_var == .y) %>% .$num_var)$p.value))
test_results %>%
spread(var2, ttest)
var1 A B C
<chr> <dbl> <dbl> <dbl>
1 A 1 0.0436 0.0197
2 B 0.0436 1 0.367
3 C 0.0197 0.367 1
This is a bit easier to read if you wrap t.test in a function:
ttester <- function(v1, v2) {
t <- t.test(df %>% filter(group_var == v1) %>% .$num_var,
df %>% filter(group_var == v2) %>% .$num_var)
t$p.value
}
cross_df(list(var1 = c("A", "B", "C"), var2 = c("A", "B", "C"))) %>%
mutate(ttest = map2_dbl(var1, var2, ~ttester(.x, .y))) %>%
spread(var2, ttest)
var1 A B C
<chr> <dbl> <dbl> <dbl>
1 A 1 0.0436 0.0197
2 B 0.0436 1 0.367
3 C 0.0197 0.367 1
Upvotes: 1
Reputation: 9570
First, some data:
df <-
data_frame(
Group = rep(LETTERS[1:8], each = 10)
, Value = rnorm(80)
)
From this, I am pulling the unique group levels:
my_groups <-
sort(unique(df$Group))
Then, I like using lapply
to loop through the metrics of interest. Basically, for every pair of groups, I am running a t-test and recording the metrics of interest (group means, difference, p-value) as a data_frame
then binding the rows together. Note that I am using the %$%
operator from magrittr
as a bit of a shortcut to get the metrics out of the t.test
result.
t_tests_out <-
lapply(my_groups, function(group_a){
lapply(my_groups, function(group_b){
# Skip case where a and b are the same
if(group_a == group_b){
return(NULL)
}
df %>%
filter(Group %in% c(group_a, group_b)) %>%
mutate(temp_group = ifelse(Group == group_a, "A", "B")) %>%
t.test(Value ~ temp_group, data = .) %$%
data_frame(
group_a = group_a
, group_b = group_b
, mean_a = estimate[1]
, mean_b = estimate[2]
, diff = mean_a - mean_b
, pval = p.value
)
}) %>%
bind_rows()
}) %>%
bind_rows()
This looks like this:
# A tibble: 56 x 6
group_a group_b mean_a mean_b diff pval
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A B -0.275 0.0851 -0.360 0.384
2 A C -0.275 -0.651 0.376 0.406
3 A D -0.275 -0.440 0.165 0.737
4 A E -0.275 0.336 -0.611 0.245
5 A F -0.275 -0.277 0.00233 0.996
6 A G -0.275 -0.115 -0.160 0.754
7 A H -0.275 -0.406 0.131 0.821
8 B A 0.0851 -0.275 0.360 0.384
9 B C 0.0851 -0.651 0.736 0.0748
10 B D 0.0851 -0.440 0.525 0.245
# ... with 46 more rows
While the long format can be really useful for somethings, like plotting the results:
t_tests_out %>%
ggplot(aes(x = group_a
, y = group_b
, fill = pval)) +
geom_tile(col = "white") +
scale_fill_distiller(palette = "YlOrRd"
, limits = c(0,1)) +
theme_minimal()
You can also spread the results to make the table you were looking for:
t_tests_out %>%
select(group_a, group_b, pval) %>%
spread(group_b, pval)
returns
# A tibble: 8 x 9
group_a A B C D E F G H
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A NA 0.384 0.406 0.737 0.245 0.996 0.754 0.821
2 B 0.384 NA 0.0748 0.245 0.595 0.439 0.668 0.371
3 C 0.406 0.0748 NA 0.659 0.0632 0.456 0.291 0.668
4 D 0.737 0.245 0.659 NA 0.163 0.762 0.547 0.955
5 E 0.245 0.595 0.0632 0.163 NA 0.280 0.425 0.243
6 F 0.996 0.439 0.456 0.762 0.280 NA 0.770 0.835
7 G 0.754 0.668 0.291 0.547 0.425 0.770 NA 0.640
8 H 0.821 0.371 0.668 0.955 0.243 0.835 0.640 NA
Upvotes: 0
Reputation: 2399
Check this solution:
library(tidyverse)
library(magrittr)
df %$%
crossing(
gr1 = grouping_variable %>% unique(),
gr2 = grouping_variable %>% unique()
) %>%
filter(gr1 != gr2) %>%
left_join(
df %>%
group_by(grouping_variable) %>%
nest() %>%
rename_all(~c('gr1', 'data1'))
) %>%
left_join(
df %>%
group_by(grouping_variable) %>%
nest() %>%
rename_all(~c('gr2', 'data2'))
) %>%
mutate(p_val = map2_dbl(
data1, data2,
~t.test(
.x$numerical_variable,
.y$numerical_variable
)$p.value
)
)
Upvotes: 0