Performing a large number of 2-sample t-tests in R

Question

So I am creating a function which allows me to take a data.frame and get a dataframe of p.values for each variable tested.

    # data and labels
my_data <- data.frame(matrix(data = rnorm(10000), nrow = 100, ncol = 10000))
labels <- sample(0:1, 100, replace = TRUE)


# append the labels to the data, then filter
my_data$labels <- labels

sample_1 <- dplyr::filter(.data = my_data, labels == 0)
sample_2 <- dplyr::filter(.data = my_data, labels == 1)


#perform a t-test on each column
p_vals <- data.frame()
for(i in c(1:10000)) {
    p_vals <- rbind(p_vals, t.test(x = sample_1[,i], y = sample_2[,i])$p.value)
}

return(p_vals)

This is functional, but I think/hope there would be a more efficient way to do this without the for loop. The data should be in rows because for later functions it will be important to keep track of which variable had which p value.

George Savva · Accepted Answer

Instead of splitting the samples you can use the formula interface to t.test, and sapply over the columns of my_data to conduct the tests:

p_vals <- sapply( my_data, function(x) t.test(x ~ labels)$p.value )

This will make a vector of p-values, the order will be the same as the columns of my_data

Performing a large number of 2-sample t-tests in R

Answers (2)

Related Questions