Stefan
Stefan

Reputation: 895

Data organization for paired t-test function in R using formula representation

I try to understand how the paired t.test() function in R calculates the differences between subjects using the formula specification (y ~ x), that is when the data is ordered in the long format and (a) arranged by row number or (b) arranged by treatment level (see examples below).

For the paired t-test, differences between the same subjects have to be calculated first (followed by other calculations). So if I change the order, how does R figure out which subjects belong together? I looked up the source code for the t.test() function (see below) but from there I cannot figure this out. My intuition tells me that it must be something like "take the first occurrence of level 1 and subtract this from the first occurrence of level 2, then the second, third, etc.". And this seems to be confirmed by the examples below. When I shuffle the rows randomly, only then is the result slightly different for the t and p-value.

Question: Which functions in the source code in the t.test() function (see below) make the calculation of the differences between subjects correct, despite the order of the rows being different?

require(tidyverse)
#> Loading required package: tidyverse
set.seed(1)
## create some toy data to run paired t.test
d <-
  tibble(ID = 1:10 ,
         PRE = c(rnorm(10, mean = 4, sd = 2)),
         POST = c(rnorm(10, mean = 6, sd = 2)))

d
#> # A tibble: 10 x 3
#>       ID   PRE  POST
#>    <int> <dbl> <dbl>
#>  1     1  2.75  9.02
#>  2     2  4.37  6.78
#>  3     3  2.33  4.76
#>  4     4  7.19  1.57
#>  5     5  4.66  8.25
#>  6     6  2.36  5.91
#>  7     7  4.97  5.97
#>  8     8  5.48  7.89
#>  9     9  5.15  7.64
#> 10    10  3.39  7.19

## pivot long to use the formula representation in t.test
d_long <- d %>%
  pivot_longer(-ID, names_to = "treat", values_to = "values")
d_long
#> # A tibble: 20 x 3
#>       ID treat values
#>    <int> <chr>  <dbl>
#>  1     1 PRE     2.75
#>  2     1 POST    9.02
#>  3     2 PRE     4.37
#>  4     2 POST    6.78
#>  5     3 PRE     2.33
#>  6     3 POST    4.76
#>  7     4 PRE     7.19
#>  8     4 POST    1.57
#>  9     5 PRE     4.66
#> 10     5 POST    8.25
#> 11     6 PRE     2.36
#> 12     6 POST    5.91
#> 13     7 PRE     4.97
#> 14     7 POST    5.97
#> 15     8 PRE     5.48
#> 16     8 POST    7.89
#> 17     9 PRE     5.15
#> 18     9 POST    7.64
#> 19    10 PRE     3.39
#> 20    10 POST    7.19

## arrange by treat variable to change order
d_long_arranged <- d_long %>% 
  arrange(desc(treat))
d_long_arranged
#> # A tibble: 20 x 3
#>       ID treat values
#>    <int> <chr>  <dbl>
#>  1     1 PRE     2.75
#>  2     2 PRE     4.37
#>  3     3 PRE     2.33
#>  4     4 PRE     7.19
#>  5     5 PRE     4.66
#>  6     6 PRE     2.36
#>  7     7 PRE     4.97
#>  8     8 PRE     5.48
#>  9     9 PRE     5.15
#> 10    10 PRE     3.39
#> 11     1 POST    9.02
#> 12     2 POST    6.78
#> 13     3 POST    4.76
#> 14     4 POST    1.57
#> 15     5 POST    8.25
#> 16     6 POST    5.91
#> 17     7 POST    5.97
#> 18     8 POST    7.89
#> 19     9 POST    7.64
#> 20    10 POST    7.19

## randomly rearrange order of rows
d_long_random <- d_long %>% 
  slice(sample(1:n()))
d_long_random
#> # A tibble: 20 x 3
#>       ID treat values
#>    <int> <chr>  <dbl>
#>  1     5 POST    8.25
#>  2     3 POST    4.76
#>  3     8 PRE     5.48
#>  4     6 POST    5.91
#>  5     5 PRE     4.66
#>  6     4 PRE     7.19
#>  7    10 PRE     3.39
#>  8     8 POST    7.89
#>  9     4 POST    1.57
#> 10     7 PRE     4.97
#> 11     9 POST    7.64
#> 12     9 PRE     5.15
#> 13     7 POST    5.97
#> 14     1 POST    9.02
#> 15     2 PRE     4.37
#> 16    10 POST    7.19
#> 17     3 PRE     2.33
#> 18     2 POST    6.78
#> 19     6 PRE     2.36
#> 20     1 PRE     2.75

## Run t.tests
t.test(d$POST, d$PRE, paired = T)
#> 
#>  Paired t-test
#> 
#> data:  d$POST and d$PRE
#> t = 2.2879, df = 9, p-value = 0.04794
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  0.02508677 4.44148199
#> sample estimates:
#> mean of the differences 
#>                2.233284
t.test(values ~ treat, paired = T, d_long)
#> 
#>  Paired t-test
#> 
#> data:  values by treat
#> t = 2.2879, df = 9, p-value = 0.04794
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  0.02508677 4.44148199
#> sample estimates:
#> mean of the differences 
#>                2.233284
t.test(values ~ treat, paired = T, d_long_arranged)
#> 
#>  Paired t-test
#> 
#> data:  values by treat
#> t = 2.2879, df = 9, p-value = 0.04794
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  0.02508677 4.44148199
#> sample estimates:
#> mean of the differences 
#>                2.233284
t.test(values ~ treat, paired = T, d_long_random)
#> 
#>  Paired t-test
#> 
#> data:  values by treat
#> t = 2.3054, df = 9, p-value = 0.04658
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  0.04193459 4.42463416
#> sample estimates:
#> mean of the differences 
#>                2.233284

## pull up t.test.default source code
getAnywhere(t.test.default)
#> A single object matching 't.test.default' was found
#> It was found in the following places
#>   registered S3 method for t.test from namespace stats
#>   namespace:stats
#> with value
#> 
#> function (x, y = NULL, alternative = c("two.sided", "less", "greater"), 
#>     mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, 
#>     ...) 
#> {
#>     alternative <- match.arg(alternative)
#>     if (!missing(mu) && (length(mu) != 1 || is.na(mu))) 
#>         stop("'mu' must be a single number")
#>     if (!missing(conf.level) && (length(conf.level) != 1 || !is.finite(conf.level) || 
#>         conf.level < 0 || conf.level > 1)) 
#>         stop("'conf.level' must be a single number between 0 and 1")
#>     if (!is.null(y)) {
#>         dname <- paste(deparse(substitute(x)), "and", deparse(substitute(y)))
#>         if (paired) 
#>             xok <- yok <- complete.cases(x, y)
#>         else {
#>             yok <- !is.na(y)
#>             xok <- !is.na(x)
#>         }
#>         y <- y[yok]
#>     }
#>     else {
#>         dname <- deparse(substitute(x))
#>         if (paired) 
#>             stop("'y' is missing for paired test")
#>         xok <- !is.na(x)
#>         yok <- NULL
#>     }
#>     x <- x[xok]
#>     if (paired) {
#>         x <- x - y
#>         y <- NULL
#>     }
#>     nx <- length(x)
#>     mx <- mean(x)
#>     vx <- var(x)
#>     if (is.null(y)) {
#>         if (nx < 2) 
#>             stop("not enough 'x' observations")
#>         df <- nx - 1
#>         stderr <- sqrt(vx/nx)
#>         if (stderr < 10 * .Machine$double.eps * abs(mx)) 
#>             stop("data are essentially constant")
#>         tstat <- (mx - mu)/stderr
#>         method <- if (paired) 
#>             "Paired t-test"
#>         else "One Sample t-test"
#>         estimate <- setNames(mx, if (paired) 
#>             "mean of the differences"
#>         else "mean of x")
#>     }
#>     else {
#>         ny <- length(y)
#>         if (nx < 1 || (!var.equal && nx < 2)) 
#>             stop("not enough 'x' observations")
#>         if (ny < 1 || (!var.equal && ny < 2)) 
#>             stop("not enough 'y' observations")
#>         if (var.equal && nx + ny < 3) 
#>             stop("not enough observations")
#>         my <- mean(y)
#>         vy <- var(y)
#>         method <- paste(if (!var.equal) 
#>             "Welch", "Two Sample t-test")
#>         estimate <- c(mx, my)
#>         names(estimate) <- c("mean of x", "mean of y")
#>         if (var.equal) {
#>             df <- nx + ny - 2
#>             v <- 0
#>             if (nx > 1) 
#>                 v <- v + (nx - 1) * vx
#>             if (ny > 1) 
#>                 v <- v + (ny - 1) * vy
#>             v <- v/df
#>             stderr <- sqrt(v * (1/nx + 1/ny))
#>         }
#>         else {
#>             stderrx <- sqrt(vx/nx)
#>             stderry <- sqrt(vy/ny)
#>             stderr <- sqrt(stderrx^2 + stderry^2)
#>             df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny - 
#>                 1))
#>         }
#>         if (stderr < 10 * .Machine$double.eps * max(abs(mx), 
#>             abs(my))) 
#>             stop("data are essentially constant")
#>         tstat <- (mx - my - mu)/stderr
#>     }
#>     if (alternative == "less") {
#>         pval <- pt(tstat, df)
#>         cint <- c(-Inf, tstat + qt(conf.level, df))
#>     }
#>     else if (alternative == "greater") {
#>         pval <- pt(tstat, df, lower.tail = FALSE)
#>         cint <- c(tstat - qt(conf.level, df), Inf)
#>     }
#>     else {
#>         pval <- 2 * pt(-abs(tstat), df)
#>         alpha <- 1 - conf.level
#>         cint <- qt(1 - alpha/2, df)
#>         cint <- tstat + c(-cint, cint)
#>     }
#>     cint <- mu + cint * stderr
#>     names(tstat) <- "t"
#>     names(df) <- "df"
#>     names(mu) <- if (paired || !is.null(y)) 
#>         "difference in means"
#>     else "mean"
#>     attr(cint, "conf.level") <- conf.level
#>     rval <- list(statistic = tstat, parameter = df, p.value = pval, 
#>         conf.int = cint, estimate = estimate, null.value = mu, 
#>         stderr = stderr, alternative = alternative, method = method, 
#>         data.name = dname)
#>     class(rval) <- "htest"
#>     rval
#> }
#> <bytecode: 0x0000000017f32b58>
#> <environment: namespace:stats>

## pull up t.test.formula source code
getAnywhere(t.test.formula)
#> A single object matching 't.test.formula' was found
#> It was found in the following places
#>   registered S3 method for t.test from namespace stats
#>   namespace:stats
#> with value
#> 
#> function (formula, data, subset, na.action, ...) 
#> {
#>     if (missing(formula) || (length(formula) != 3L) || (length(attr(terms(formula[-2L]), 
#>         "term.labels")) != 1L)) 
#>         stop("'formula' missing or incorrect")
#>     m <- match.call(expand.dots = FALSE)
#>     if (is.matrix(eval(m$data, parent.frame()))) 
#>         m$data <- as.data.frame(data)
#>     m[[1L]] <- quote(stats::model.frame)
#>     m$... <- NULL
#>     mf <- eval(m, parent.frame())
#>     DNAME <- paste(names(mf), collapse = " by ")
#>     names(mf) <- NULL
#>     response <- attr(attr(mf, "terms"), "response")
#>     g <- factor(mf[[-response]])
#>     if (nlevels(g) != 2L) 
#>         stop("grouping factor must have exactly 2 levels")
#>     DATA <- setNames(split(mf[[response]], g), c("x", "y"))
#>     y <- do.call("t.test", c(DATA, list(...)))
#>     y$data.name <- DNAME
#>     if (length(y$estimate) == 2L) 
#>         names(y$estimate) <- paste("mean in group", levels(g))
#>     y
#> }
#> <bytecode: 0x000000001807a9a0>
#> <environment: namespace:stats>

Created on 2020-01-26 by the reprex package (v0.3.0)

Upvotes: 1

Views: 1269

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 174478

The cleverness is mathematical, not related to the software implementation.

Remember, all a paired t-test does to get the mean differences is to sum the differences between column one and column 2 (think sum(post[1] - pre[1], post[2] - pre[2], post[3] - pre[3] ... etc), and divide by the length of the data frame (nrow(d)). The sum of differences can be rearranged to sum(post[1:10] - pre[1:10]), which itself can be rearranged to sum(post) - sum(pre). So the sum of the differences is equal to the differences of the sums. Of course, nrow(d) stays the same, so the mean of differences will always be the same, regardless of order.

You'll notice that the p value and t scores are slightly different when you change the order. That's because the distribution of differences will be different when you shuffle the observations. However, the sum of differences (and hence their mean) remains the same.

Upvotes: 3

Related Questions