Reputation: 21

Iterating t-test with dplyr

I'm trying to find an elegant way to preform a t-test comparing the means of 6 groups of data, preferably using dplyr/tidyverse. My data looks something along the lines of:

Grouping Variable Numerical variable

A 5.6

A 2.3

A 4.8

B 7.3

B 6.9

B 5.8

C 1.4

C 6.4

I know I can do something like:

df_a <- df %>% filter(grouping_variable == 'A')
df_b <- df %>% filter(grouping_variable == 'B')
a_b <- t.test(df_a,df_b)$p.value

And then repeat that for every variable combo. There are only 6 grouping variables, so the above isn't out of the question, but there has to be a simpler way along the lines of:

df %>% group_by(grouping_variable)%>%
t.test(of each on each)

Maybe something with tidy?

My end result is to get a tibble along the lines of

A B C D E F

A .34 .4 .235 ...

B .03 .34 .454...

Upvotes: 2

Answers (4)

Mankind_2000

Reputation: 2218

You are looking for pairwise.t.test. It allows you to mention a p-value adjustment method as well as the alternative hypothesis. Refer R documentation for details.

Usage:

pairwise.t.test(x, g, p.adjust.method = p.adjust.methods,
            pool.sd = !paired, paired = FALSE,
            alternative = c("two.sided", "less", "greater"),
            ...)

For your case, you can do something like:

pairwise.ttest <- pairwise.t.test(x = df$num_var, g = df$group_var)
pairwise.ttest$p.value

Upvotes: 0

benc

Reputation: 376

This can be done cleanly using the cross and map functions from purrr.

Sample data:

df <- tibble(group_var = rep(c("A", "B", "C"), times = 5), 
         num_var = rnorm(15))
df
# A tibble: 15 x 2
   group_var num_var
   <chr>       <dbl>
 1 A          1.66  
 2 B         -0.694 
 3 C         -0.680 
 4 A          1.96  
 5 B         -0.380 
 6 C         -0.941 
 7 A          1.02  
 8 B          0.0476
 9 C          0.770 
10 A          1.41  
11 B          0.137 
12 C         -0.816 
13 A         -0.478 
14 B          0.374 
15 C         -0.619

Use cross to create a dataframe with all the variable combinations:

test_results <- cross_df(list(var1 = c("A", "B", "C"), var2 = c("A", "B", "C")))

Add column with ttest results:

test_results <- test_results %>% 
  mutate(ttest = map2_dbl(var1, var2, 
                          ~ t.test(df %>% filter(group_var == .x) %>% .$num_var,
                                   df %>% filter(group_var == .y) %>% .$num_var)$p.value))

 test_results %>% 
  spread(var2, ttest)
  var1       A      B      C
  <chr>  <dbl>  <dbl>  <dbl>
1 A     1      0.0436 0.0197
2 B     0.0436 1      0.367 
3 C     0.0197 0.367  1

This is a bit easier to read if you wrap t.test in a function:

ttester <- function(v1, v2) {
  t <- t.test(df %>% filter(group_var == v1) %>% .$num_var,
              df %>% filter(group_var == v2) %>% .$num_var)
  t$p.value
}

cross_df(list(var1 = c("A", "B", "C"), var2 = c("A", "B", "C"))) %>% 
  mutate(ttest = map2_dbl(var1, var2, ~ttester(.x, .y))) %>% 
  spread(var2, ttest)
  var1       A      B      C
  <chr>  <dbl>  <dbl>  <dbl>
1 A     1      0.0436 0.0197
2 B     0.0436 1      0.367 
3 C     0.0197 0.367  1

Upvotes: 1

Mark Peterson

Reputation: 9570

First, some data:

df <-
  data_frame(
    Group = rep(LETTERS[1:8], each = 10)
    , Value = rnorm(80)
  )

From this, I am pulling the unique group levels:

my_groups <-
  sort(unique(df$Group))

Then, I like using lapply to loop through the metrics of interest. Basically, for every pair of groups, I am running a t-test and recording the metrics of interest (group means, difference, p-value) as a data_frame then binding the rows together. Note that I am using the %$% operator from magrittr as a bit of a shortcut to get the metrics out of the t.test result.

t_tests_out <-
  lapply(my_groups, function(group_a){
    lapply(my_groups, function(group_b){

      # Skip case where a and b are the same
      if(group_a == group_b){
        return(NULL)
      }

      df %>%
        filter(Group %in% c(group_a, group_b)) %>%
        mutate(temp_group = ifelse(Group == group_a, "A", "B")) %>%
        t.test(Value ~ temp_group, data = .) %$%
        data_frame(
          group_a = group_a
          , group_b = group_b
          , mean_a = estimate[1]
          , mean_b = estimate[2]
          , diff = mean_a - mean_b
          , pval = p.value
        )

    }) %>%
      bind_rows()
  }) %>%
  bind_rows()

This looks like this:

# A tibble: 56 x 6
   group_a group_b  mean_a  mean_b     diff   pval
   <chr>   <chr>     <dbl>   <dbl>    <dbl>  <dbl>
 1 A       B       -0.275   0.0851 -0.360   0.384 
 2 A       C       -0.275  -0.651   0.376   0.406 
 3 A       D       -0.275  -0.440   0.165   0.737 
 4 A       E       -0.275   0.336  -0.611   0.245 
 5 A       F       -0.275  -0.277   0.00233 0.996 
 6 A       G       -0.275  -0.115  -0.160   0.754 
 7 A       H       -0.275  -0.406   0.131   0.821 
 8 B       A        0.0851 -0.275   0.360   0.384 
 9 B       C        0.0851 -0.651   0.736   0.0748
10 B       D        0.0851 -0.440   0.525   0.245 
# ... with 46 more rows

While the long format can be really useful for somethings, like plotting the results:

t_tests_out %>%
  ggplot(aes(x = group_a
             , y = group_b
             , fill = pval)) +
  geom_tile(col = "white") +
  scale_fill_distiller(palette = "YlOrRd"
                       , limits = c(0,1)) +
  theme_minimal()

You can also spread the results to make the table you were looking for:

t_tests_out %>%
  select(group_a, group_b, pval) %>%
  spread(group_b, pval)

returns

# A tibble: 8 x 9
  group_a      A       B       C      D       E      F      G      H
  <chr>    <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
1 A       NA      0.384   0.406   0.737  0.245   0.996  0.754  0.821
2 B        0.384 NA       0.0748  0.245  0.595   0.439  0.668  0.371
3 C        0.406  0.0748 NA       0.659  0.0632  0.456  0.291  0.668
4 D        0.737  0.245   0.659  NA      0.163   0.762  0.547  0.955
5 E        0.245  0.595   0.0632  0.163 NA       0.280  0.425  0.243
6 F        0.996  0.439   0.456   0.762  0.280  NA      0.770  0.835
7 G        0.754  0.668   0.291   0.547  0.425   0.770 NA      0.640
8 H        0.821  0.371   0.668   0.955  0.243   0.835  0.640 NA

Upvotes: 0

Paweł Chabros

Reputation: 2399

Check this solution:

library(tidyverse)
library(magrittr)

df %$% 
crossing(
  gr1 = grouping_variable %>% unique(),
  gr2 = grouping_variable %>% unique()
) %>%
  filter(gr1 != gr2) %>%
  left_join(
    df %>%
      group_by(grouping_variable) %>%
      nest() %>%
      rename_all(~c('gr1', 'data1'))
  ) %>%
  left_join(
    df %>%
      group_by(grouping_variable) %>%
      nest() %>%
      rename_all(~c('gr2', 'data2'))
  ) %>%
  mutate(p_val = map2_dbl(
      data1, data2,
      ~t.test(
        .x$numerical_variable,
        .y$numerical_variable
      )$p.value
    )
  )

Upvotes: 0

Iterating t-test with dplyr

Answers (4)

Related Questions