Reputation: 91
So I am creating a function which allows me to take a data.frame and get a dataframe of p.values for each variable tested.
# data and labels
my_data <- data.frame(matrix(data = rnorm(10000), nrow = 100, ncol = 10000))
labels <- sample(0:1, 100, replace = TRUE)
# append the labels to the data, then filter
my_data$labels <- labels
sample_1 <- dplyr::filter(.data = my_data, labels == 0)
sample_2 <- dplyr::filter(.data = my_data, labels == 1)
#perform a t-test on each column
p_vals <- data.frame()
for(i in c(1:10000)) {
p_vals <- rbind(p_vals, t.test(x = sample_1[,i], y = sample_2[,i])$p.value)
}
return(p_vals)
This is functional, but I think/hope there would be a more efficient way to do this without the for loop. The data should be in rows because for later functions it will be important to keep track of which variable had which p value.
Upvotes: 1
Views: 280
Reputation: 46978
You can also use the package genefilter:
library(genefilter)
colttests(as.matrix(my_data[,-ncol(my_data)]),factor(my_data$labels))
Upvotes: 1
Reputation: 5336
Instead of splitting the samples you can use the formula interface to t.test
, and sapply
over the columns of my_data
to conduct the tests:
p_vals <- sapply( my_data, function(x) t.test(x ~ labels)$p.value )
This will make a vector of p-values, the order will be the same as the columns of my_data
Upvotes: 2