civy
civy

Reputation: 423

Perform multiple two-sample t-test using dplyr in R

I would like to perform multiple pairwise t-tests on a dataset containing about 400 different column variables and 3 subject groups, and extract p-values for every comparison. A shorter representative example of the data, using only 2 variables could be the following;

df <- tibble(var1 = rnorm(90, 1, 1), var2 = rnorm(90, 1.5, 1), group = rep(1:3, each = 30))

Ideally the end result will be a summarised data frame containing four columns; one for the variable being tested (var1, var2 etc.), two for the groups being tested every time and a final one for the p-value.

I've tried duplicating the group column in the long form, and doing a double group_by in order to do the comparisons but with no result

result <- df %>%
pivot_longer(var1:var2, "var", "value") %>%
rename(group_a = group) %>%
mutate(group_b = group_a) %>%
group_by(group_a, group_b) %>%
summarise(n = n())

Upvotes: 0

Views: 1356

Answers (2)

the-mad-statter
the-mad-statter

Reputation: 8676

In case you end up wanting more information about the t-tests, here is an approach that will allow you to extract more information such as the degrees of freedom and value of the test statistic:

library(dplyr)
library(tidyr)
library(purrr)
library(broom)

df <- tibble(
  var1 = rnorm(90, 1, 1), 
  var2 = rnorm(90, 1.5, 1), 
  group = rep(1:3, each = 30)
)

df %>% 
  select(-group) %>% 
  names() %>% 
  map_dfr(~ {
    y <- .
    
    combn(3, 2) %>% 
      t() %>% 
      as.data.frame() %>% 
      pmap_dfr(function(V1, V2) {
        df %>% 
          select(group, all_of(y)) %>% 
          filter(group %in% c(V1, V2)) %>% 
          t.test(as.formula(sprintf("%s ~ group", y)), ., var.equal = TRUE) %>% 
          tidy() %>% 
          transmute(y = y, 
                    group_1 = V1, 
                    group_2 = V2, 
                    df = parameter, 
                    t_value = statistic, 
                    p_value = p.value
          )
      })
  })
#> # A tibble: 6 x 6
#>   y     group_1 group_2    df t_value p_value
#>   <chr>   <int>   <int> <dbl>   <dbl>   <dbl>
#> 1 var1        1       2    58  -0.337  0.737 
#> 2 var1        1       3    58  -1.35   0.183 
#> 3 var1        2       3    58  -1.06   0.295 
#> 4 var2        1       2    58  -0.152  0.879 
#> 5 var2        1       3    58   1.72   0.0908
#> 6 var2        2       3    58   1.67   0.100

And here is @akrun's answer tweaked to give the same p-values as the above approach. Note the p.adjust.method = "none" which gives independent t-tests which will inflate your Type I error rate.

df %>% 
  pivot_longer(
    cols = -group, 
    names_to = "y"
  ) %>% 
  group_by(y) %>%
  summarise(
    out = list(
      tidy(
        pairwise.t.test(
          value, 
          group, 
          p.adjust.method = "none",
          pool.sd = FALSE
        )
      )
    )
  ) %>% 
  unnest(c(out))
#> # A tibble: 6 x 4
#>   y     group1 group2 p.value
#>   <chr> <chr>  <chr>    <dbl>
#> 1 var1  2      1       0.737 
#> 2 var1  3      1       0.183 
#> 3 var1  3      2       0.295 
#> 4 var2  2      1       0.879 
#> 5 var2  3      1       0.0909
#> 6 var2  3      2       0.100

Created on 2021-07-30 by the reprex package (v1.0.0)

Upvotes: 1

akrun
akrun

Reputation: 886938

We can reshape the data into 'long' format with pivot_longer, then grouped by 'group', apply the pairwise.t.test, extract the list elements and transform into tibble with tidy (from broom) and unnest the list column

library(dplyr)
library(tidyr)
library(broom)
df %>% 
    pivot_longer(cols = -group, names_to = 'grp') %>% 
    group_by(group) %>%
    summarise(out = list(pairwise.t.test(value, grp
        ) %>% 
             tidy)) %>% 
    unnest(c(out))

-output

# A tibble: 3 x 4
  group group1 group2  p.value
  <int> <chr>  <chr>     <dbl>
1     1 var2   var1   0.0760  
2     2 var2   var1   0.0233  
3     3 var2   var1   0.000244

Upvotes: 2

Related Questions