obizues
obizues

Reputation: 1483

How to calculate a paired t-test for one column in a data frame to all other columns in a single statement using R

I have a data frame with about 20 different columns of data. The first column has two options: the result being true or false.

I want to do a paired t.test between the first column and the rest for a total of 19 tests, with the goal of ranking how well those other 19 columns can predict a true value.

I’m hoping there is a way to essentially loop through the columns while keeping the first column the whole time.

This would iterate through the columns left to right, but not keep the first column (a) static the whole time while incrementing the second column. Such as A&B, B&C, C&D, etc.

Code:

tests = lapply(seq(1,(length(df)-1)),function(x){t.test(df[,x],df[,x+1])}) 

Instead what I want is: A&B, A&C, A&D, etc.

Upvotes: 0

Views: 2374

Answers (2)

dcarlson
dcarlson

Reputation: 11056

As the comments note, this is a two-sample t-test not a paired t-test unless you add paired=TRUE, but it fixes the first column and runs through the rest:

tests <- lapply(seq(2, length(df)), function(x){t.test(df[,1], df[,x])})

If you are using the first column to define two groups then then it would be as follows:

tests <- lapply(seq(2, length(df)), function(x){t.test(df[,x]~df[,1])})

This would be a two-sample t-test with each column split into two groups defined by column 1.

Upvotes: 0

Edward
Edward

Reputation: 18798

I'm wondering if you really want to do an unpaired t-test. The reason I say this is that you described the first column as being TRUE or FALSE and then said your goal was to see how well the other columns could predict a TRUE value. Or in other words, whether the means of the 19 other columns are significantly different between the TRUE and FALSE groups. If you really wanted to do a paired t-test, then your data, as described, is not quite in the correct format. Unless you want to compare x2 and x3 or x3 and x4 etc. Then you'd use the following:

t.test(df$x2, df$x3, paired=TRUE)

Performing an unpaired t-tests on the second column with the first column as the group variable is achieved using the formula method. For example, to compare the means of the second variable between the TRUE and FALSE groups, you can do:

t.test(x1 ~ group, data=df)

And this is an unpaired, two-sample t-test. It can also be done slightly differently for reasons which will become evident later.

t.test(df$x1 ~ df$group)
t.test(df[,2] ~ df[,1])

The latter version allows you to then perform repeated tests using the lapply function as mentioned.

tests <- lapply(2:20, function(x) t.test(df[,x] ~ df[,1]))

This returns an un-named list, which can be named using the names of the data frame.

names(tests) <- names(df)[2:20]
tests[1]

$x1

    Welch Two Sample t-test

data:  df[, x] by df[, 1]
t = -0.83536, df = 94.695, p-value = 0.4056
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5.339658  2.176944
sample estimates:
mean in group FALSE  mean in group TRUE 
           48.46547            50.04683

You can also tidy this using the broom package.

lapply(tests,  broom::tidy)

$x1
# A tibble: 1 x 10
  estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method      alternative
     <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr>       <chr>      
1    -1.58      48.5      50.0    -0.835   0.406      94.7    -5.34      2.18 Welch Two ~ two.sided  

The dplyr version would be to use the do function instead of lapply, but first the data frame needs to be tidied into a long format.

library(dplyr)
library(tidyr)

df %>% pivot_longer(cols=starts_with("x")) %>%
  group_by(name) %>%
  do(tidy(t.test(.$value ~ .$group)))

# A tibble: 19 x 11
# Groups:   name [19]
   name  estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
   <chr>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
 1 x1     -1.58        48.5      50.0   -0.835   0.406       94.7   -5.34      2.18 
 2 x10    -0.377       49.3      49.6   -0.194   0.847       95.1   -4.24      3.49 
 3 x11     4.49        53.1      48.6    2.08    0.0400      97.8    0.209     8.77 
 4 x12    -1.05        51.1      52.2   -0.450   0.654       88.9   -5.70      3.59 
 5 x13    -0.743       49.4      50.1   -0.360   0.720       96.8   -4.84      3.35 
 6 x14     0.908       51.5      50.6    0.487   0.627       93.3   -2.79      4.61 

Data:

set.seed(123)
n <- 100; m=19  # number of subjects (rows) and number of "x" columns
X <- data.frame(matrix(rnorm(n*m, mean=50, sd=10), byrow=TRUE, nc=m))
colnames(X) <- paste0("x", 1:19)
df <- data.frame(group=sample(c(TRUE, FALSE), size=n, replace=TRUE), X)
str(df)

'data.frame':   100 obs. of  20 variables:
 $ group: logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
 $ x1   : num  44.4 45.3 46.9 55.8 47.2 ...
 $ x2   : num  47.7 39.3 46.2 51.2 37.8 ...
 $ x3   : num  65.6 47.8 43.1 52.2 51.8 ...
 $ x4   : num  50.7 39.7 47.9 53.8 48.6 ...
 $ x5   : num  51.3 42.7 37.3 45 50.1 ...
 $ x6   : num  67.2 43.7 71.7 46.7 53.9 ...

Upvotes: 1

Related Questions