paxton
paxton

Reputation: 91

Performing a large number of 2-sample t-tests in R

So I am creating a function which allows me to take a data.frame and get a dataframe of p.values for each variable tested.

    # data and labels
my_data <- data.frame(matrix(data = rnorm(10000), nrow = 100, ncol = 10000))
labels <- sample(0:1, 100, replace = TRUE)


# append the labels to the data, then filter
my_data$labels <- labels

sample_1 <- dplyr::filter(.data = my_data, labels == 0)
sample_2 <- dplyr::filter(.data = my_data, labels == 1)


#perform a t-test on each column
p_vals <- data.frame()
for(i in c(1:10000)) {
    p_vals <- rbind(p_vals, t.test(x = sample_1[,i], y = sample_2[,i])$p.value)
}

return(p_vals)

This is functional, but I think/hope there would be a more efficient way to do this without the for loop. The data should be in rows because for later functions it will be important to keep track of which variable had which p value.

Upvotes: 1

Views: 280

Answers (2)

StupidWolf
StupidWolf

Reputation: 46978

You can also use the package genefilter:

library(genefilter)
colttests(as.matrix(my_data[,-ncol(my_data)]),factor(my_data$labels))

Upvotes: 1

George Savva
George Savva

Reputation: 5336

Instead of splitting the samples you can use the formula interface to t.test, and sapply over the columns of my_data to conduct the tests:

p_vals <- sapply( my_data, function(x) t.test(x ~ labels)$p.value )

This will make a vector of p-values, the order will be the same as the columns of my_data

Upvotes: 2

Related Questions