Reputation: 10483
I have an R data frame with a factor variable with 8 levels (ordered). I want to do a t-test between level 1 & 2, 3 & 4, 5 & 6 and 7 & 8. While I can subset the data to extract each pair of categories, I am wondering if there is a easier way to do it. Can't figure out. Tried the following, but it complains about differing lengths (each level has different number of observations):
t.test(var1 ~ levels(factorvar)[1:2], data = mydf)
Upvotes: 2
Views: 8433
Reputation: 263451
pairs <- list( c(1, 2), c(3, 4), c(5, 6), c(7, 8) )
lapply(pairs, function(pr) {
t.test( var1 ~ factorvar,
data=dat[dat$factorvar %in$ pr, c("var1", "factorvar")] )
}
)
I don't the the extra (unrepresented) levels should cause problems with t.test.formula
since the factors would get coerced to mumeric. Could also try:
lapply(pairs, function(pr) {
t.test( var1 ~ factorvar,
data=dat[ , c("var1", "factorvar")],
subset= factorvar %in% pr)
} )
Note: Tested with:
dat <- data.frame(var1=rnorm(100),
factorvar=factor(sample(1:8, 100, rep=TRUE)))
Sample output:
[[1]]
Welch Two Sample t-test
data: var1 by factorvar
t = -1.2077, df = 8.419, p-value = 0.26
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.3597432 0.4197142
sample estimates:
mean in group 1 mean in group 2
-0.1819342 0.2880802
[[2]]
Welch Two Sample t-test
data: var1 by factorvar
t = -0.8141, df = 20.676, p-value = 0.4249
#--------- rest of output snipped-------
Upvotes: 2
Reputation: 93861
I think the error is probably because levels(factorvar)[1:2]
returns just two values "1" and "2", but t.test
expects the length of the vectors on both sides of the ~
to be the same. In other words, it's not an issue of having different numbers of observations in each factor level. Rather, if, for example, you have 40 values of var1
for factorvar
=1 and 50 values of var1
for factorvar
=2, then you need a vector of length 90 on both sides of the ~
.
Try this instead:
t.test(var1 ~ factorvar, data=mydf[mydf$factorvar %in% c(1,2),])
You can also create a function so that you don't have to type all that code for each combination of factors:
# Function to return p-values from t-test between two factor levels
my.t = function(fac1, fac2){
t.test(mydf$var1[mydf$factorvar==fac1],
mydf$var1[mydf$factorvar==fac2])$p.value
}
# Run the function on factor levels 1 and 2
my.t(1,2)
# Do all four at once
mapply(my.t, seq(1,7,2), seq(2,8,2))
If you want to return the entire output of the t-test for each pair of factor levels (rather than just the p-values), then remove the $p.value
from the function above and run mapply
with SIMPLIFY=FALSE
added.
This is a coding site, rather than a statistical advice site, but also beware of multiple comparisons.
Upvotes: 2