doggysaywhat
doggysaywhat

Reputation: 25

Perform t-test using aggregate function in R

I'm having difficulty using the unpaired t-test and the aggregate function.

Example

dd<-data.frame(names=c("1st","1st","1st","1st","2nd","2nd","2nd","2nd"),a=c(11,12,13,14,2.1,2.2,2.3,2.4),b=c(3.1,3.2,3.3,3.4,3.1,3.2,3.3,3.4))
dd
#  Compare all the values in the "a" column that match with "1st" against the values in the "b" column that match "1st".  
#  Then, do the same thing with those matching "2nd"

t.test(c(11,12,13,14),c(3.1,3.2,3.3,3.4))$p.value
t.test(c(3.1,3.2,3.3,3.4),c(3.1,3.2,3.3,3.4))$p.value

#  Also need to replace any errors from t.test that have too low variance with NA
#  An example of the type of error I might run into would be if the "b" column was replaced with c(3,3,3,3,3,3,3,3).  

For paired data, I found a work around.

#  Create Paired data.
data_paired<-dd[,3]-dd[,2]

#  Create new t-test so that it doesn't crash upon the first instance of an error.  
my_t.test<-function(x){
    A<-try(t.test(x), silent=TRUE)
    if (is(A, "try-error")) return(NA) else return(A$p.value)
}

#  Use aggregate with new t-test.  
aggregate(data_paired, by=list(dd$name),FUN=my_t.test)

This aggregate works with a single column of input. However, I can't get it to function when I must have several columns go into the function.

Example:

my_t.test2<-function(x,y){
    A<-try(t.test(x,y,paired=FALSE), silent=TRUE)
    if (is(A, "try-error")) return(NA) else return(A$p.value)
}

aggregate(dd[,c(2,3)],by=list(dd$name),function(x,y) my_t.test2(dd[,3],dd[,2]))

I had thought that the aggregate function would only send the rows matching the value in the list to the function my_t.test2 and then move onto the next list element. However, the results produced indicate that it is performing a t-test on all values in the column like below. And then placing each of those values in the results.

t.test(dd[,3],dd[,2])$p.value

What am I missing? Is this an issues with the original my_test.2, an issue with how to structure the aggregate function, or something else. The way I applied it doesn't seem to aggregate.

These are the results I want.

t.test(c(11,12,13,14),c(3.1,3.2,3.3,3.4))$p.value
t.test(c(3.1,3.2,3.3,3.4),c(3.1,3.2,3.3,3.4))$p.value

To Note, this is a toy example and the actual data set will have well over 100,000 entries that need to be grouped by the value in the names column. Hence why I need the aggregate function.

Thanks for the help.

Upvotes: 1

Views: 4039

Answers (2)

shadow
shadow

Reputation: 22343

As @MrFlick said, agregate is not the right function to do this. Here are some alternatives to the sapply approach, using the dplyr or data.table packages.

require(dplyr)
summarize(group_by(dd, names), t.test(a,b)$p.value)

require(data.table)
data.table(dd)[, t.test(a,b)$p.value, by=names] 

Upvotes: 2

MrFlick
MrFlick

Reputation: 206536

aggregate isn't the right function to use here because the summary function only works on one column at a time. It's not possible to get both the a and b values simultaneously with this method.

Another way you could approach the problem is to split the data, then apply the t-test to each of the subset. Here's one implementation

sapply(
    split(dd[-1], dd$names), 
    function(x) t.test(x[["a"]], x[["b"]])$p.value
)

Here I split dd into a list of subset for each value of names. I use dd[-1] to drop the "names" column from the subsets to I just have a data.frame with two columns. One for a and one for b.

Then, to each subset in the list, I perform a t.test using the a and b columns. Then I extract the p-value. The sapply wrapper with calculate this p-value for each subset and rill returned a named vector of p-values where the names of the entries correspond to the levels of dd$names

         1st          2nd 
6.727462e-04 3.436403e-05 

If you wanted to do paired t-test this way, you could do

sapply(
    split(dd[-1], dd$names), 
    function(x) t.test(x[["a"]] - x[["b"]])$p.value
)

Upvotes: 2

Related Questions