Gopala
Gopala

Reputation: 10483

R: Doing t-test between pairs of factors

I have an R data frame with a factor variable with 8 levels (ordered). I want to do a t-test between level 1 & 2, 3 & 4, 5 & 6 and 7 & 8. While I can subset the data to extract each pair of categories, I am wondering if there is a easier way to do it. Can't figure out. Tried the following, but it complains about differing lengths (each level has different number of observations):

t.test(var1 ~ levels(factorvar)[1:2], data = mydf)

Upvotes: 2

Views: 8433

Answers (2)

IRTFM
IRTFM

Reputation: 263451

pairs <- list( c(1, 2), c(3, 4), c(5, 6), c(7, 8) )
lapply(pairs, function(pr) {
       t.test( var1 ~ factorvar, 
               data=dat[dat$factorvar %in$ pr, c("var1", "factorvar")] )
                          }
       )

I don't the the extra (unrepresented) levels should cause problems with t.test.formula since the factors would get coerced to mumeric. Could also try:

lapply(pairs, function(pr) {
         t.test( var1 ~ factorvar, 
                 data=dat[ , c("var1", "factorvar")],
                 subset= factorvar %in% pr)
                          } )

Note: Tested with:

dat <- data.frame(var1=rnorm(100), 
                  factorvar=factor(sample(1:8, 100, rep=TRUE)))

Sample output:

[[1]]

    Welch Two Sample t-test

data:  var1 by factorvar
t = -1.2077, df = 8.419, p-value = 0.26
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.3597432  0.4197142
sample estimates:
mean in group 1 mean in group 2 
     -0.1819342       0.2880802 


[[2]]

    Welch Two Sample t-test

data:  var1 by factorvar
t = -0.8141, df = 20.676, p-value = 0.4249
#--------- rest of output snipped-------

Upvotes: 2

eipi10
eipi10

Reputation: 93861

I think the error is probably because levels(factorvar)[1:2] returns just two values "1" and "2", but t.test expects the length of the vectors on both sides of the ~ to be the same. In other words, it's not an issue of having different numbers of observations in each factor level. Rather, if, for example, you have 40 values of var1 for factorvar=1 and 50 values of var1 for factorvar=2, then you need a vector of length 90 on both sides of the ~.

Try this instead:

t.test(var1 ~ factorvar, data=mydf[mydf$factorvar %in% c(1,2),])

You can also create a function so that you don't have to type all that code for each combination of factors:

# Function to return p-values from t-test between two factor levels
my.t = function(fac1, fac2){
  t.test(mydf$var1[mydf$factorvar==fac1], 
         mydf$var1[mydf$factorvar==fac2])$p.value
}

# Run the function on factor levels 1 and 2
my.t(1,2)

# Do all four at once
mapply(my.t, seq(1,7,2), seq(2,8,2))

If you want to return the entire output of the t-test for each pair of factor levels (rather than just the p-values), then remove the $p.value from the function above and run mapply with SIMPLIFY=FALSE added.

This is a coding site, rather than a statistical advice site, but also beware of multiple comparisons.

Upvotes: 2

Related Questions