Reputation: 1483
I have a data frame with about 20 different columns of data. The first column has two options: the result being true or false.
I want to do a paired t.test between the first column and the rest for a total of 19 tests, with the goal of ranking how well those other 19 columns can predict a true value.
I’m hoping there is a way to essentially loop through the columns while keeping the first column the whole time.
This would iterate through the columns left to right, but not keep the first column (a) static the whole time while incrementing the second column. Such as A&B, B&C, C&D, etc.
Code:
tests = lapply(seq(1,(length(df)-1)),function(x){t.test(df[,x],df[,x+1])})
Instead what I want is: A&B, A&C, A&D, etc.
Upvotes: 0
Views: 2374
Reputation: 11056
As the comments note, this is a two-sample t-test not a paired t-test unless you add paired=TRUE
, but it fixes the first column and runs through the rest:
tests <- lapply(seq(2, length(df)), function(x){t.test(df[,1], df[,x])})
If you are using the first column to define two groups then then it would be as follows:
tests <- lapply(seq(2, length(df)), function(x){t.test(df[,x]~df[,1])})
This would be a two-sample t-test with each column split into two groups defined by column 1.
Upvotes: 0
Reputation: 18798
I'm wondering if you really want to do an unpaired t-test. The reason I say this is that you described the first column as being TRUE or FALSE and then said your goal was to see how well the other columns could predict a TRUE value. Or in other words, whether the means of the 19 other columns are significantly different between the TRUE and FALSE groups. If you really wanted to do a paired t-test, then your data, as described, is not quite in the correct format. Unless you want to compare x2 and x3 or x3 and x4 etc. Then you'd use the following:
t.test(df$x2, df$x3, paired=TRUE)
Performing an unpaired t-tests on the second column with the first column as the group variable is achieved using the formula method. For example, to compare the means of the second variable between the TRUE and FALSE groups, you can do:
t.test(x1 ~ group, data=df)
And this is an unpaired, two-sample t-test. It can also be done slightly differently for reasons which will become evident later.
t.test(df$x1 ~ df$group)
t.test(df[,2] ~ df[,1])
The latter version allows you to then perform repeated tests using the lapply
function as mentioned.
tests <- lapply(2:20, function(x) t.test(df[,x] ~ df[,1]))
This returns an un-named list, which can be named using the names of the data frame.
names(tests) <- names(df)[2:20]
tests[1]
$x1
Welch Two Sample t-test
data: df[, x] by df[, 1]
t = -0.83536, df = 94.695, p-value = 0.4056
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.339658 2.176944
sample estimates:
mean in group FALSE mean in group TRUE
48.46547 50.04683
You can also tidy this using the broom package.
lapply(tests, broom::tidy)
$x1
# A tibble: 1 x 10
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 -1.58 48.5 50.0 -0.835 0.406 94.7 -5.34 2.18 Welch Two ~ two.sided
The dplyr version would be to use the do
function instead of lapply
, but first the data frame needs to be tidied into a long format.
library(dplyr)
library(tidyr)
df %>% pivot_longer(cols=starts_with("x")) %>%
group_by(name) %>%
do(tidy(t.test(.$value ~ .$group)))
# A tibble: 19 x 11
# Groups: name [19]
name estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 x1 -1.58 48.5 50.0 -0.835 0.406 94.7 -5.34 2.18
2 x10 -0.377 49.3 49.6 -0.194 0.847 95.1 -4.24 3.49
3 x11 4.49 53.1 48.6 2.08 0.0400 97.8 0.209 8.77
4 x12 -1.05 51.1 52.2 -0.450 0.654 88.9 -5.70 3.59
5 x13 -0.743 49.4 50.1 -0.360 0.720 96.8 -4.84 3.35
6 x14 0.908 51.5 50.6 0.487 0.627 93.3 -2.79 4.61
Data:
set.seed(123)
n <- 100; m=19 # number of subjects (rows) and number of "x" columns
X <- data.frame(matrix(rnorm(n*m, mean=50, sd=10), byrow=TRUE, nc=m))
colnames(X) <- paste0("x", 1:19)
df <- data.frame(group=sample(c(TRUE, FALSE), size=n, replace=TRUE), X)
str(df)
'data.frame': 100 obs. of 20 variables:
$ group: logi FALSE FALSE FALSE FALSE TRUE FALSE ...
$ x1 : num 44.4 45.3 46.9 55.8 47.2 ...
$ x2 : num 47.7 39.3 46.2 51.2 37.8 ...
$ x3 : num 65.6 47.8 43.1 52.2 51.8 ...
$ x4 : num 50.7 39.7 47.9 53.8 48.6 ...
$ x5 : num 51.3 42.7 37.3 45 50.1 ...
$ x6 : num 67.2 43.7 71.7 46.7 53.9 ...
Upvotes: 1