R: Testing each level of a factor without creating new variables

Question

Suppose I have a data frame with a binary grouping variable and a factor. An example of such a grouping variable could specify assignment to the treatment and control conditions of an experiment. In the below, b is the grouping variable while a is an arbitrary factor variable:

a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)

I want to complete two-sample t-tests to assess the below:

For each level of a, whether there is a difference in the mean propensity to adopt that level between the groups specified in b.

I have used the dummies package to create separate dummies for each level of the factor and then manually performed t-tests on the resulting variables:

library(dummies)
new <- dummy.data.frame(df, names = "a")
t.test(new$aa, new$b)
t.test(new$ab, new$b)

I am looking for help with the following:

Is there a way to perform this without creating a large number of dummy variables via dummy.data.frame()?
If there is not a quicker way to do it without creating a large number of dummies, is there a quicker way to complete the t-test across multiple columns?

Note

This is similar to but different from R - How to perform the same operation on multiple variables and nearly the same as this question Apply t-test on many columns in a dataframe split by factor but the solution of that question no longer works.

nothing · Accepted Answer

Here is a base R solution implementing a chi-squired test for equality of proportions, which I believe is more likely to answer whatever question you're asking of your data (see my comment above):

set.seed(1)

## generate similar but larger/more complex toy dataset
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 10, replace = T)
head((df <- data.frame(a,b)))

  a b
1 b 1
2 b 0
3 c 0
4 d 1
5 a 1
6 d 0

## create a set of contingency tables for proportions 
## of each level of df$a to the others
cTbls  <- lapply(unique(a), function(x) table(df$a==x, df$b))

## apply chi-squared test to each contingency table
results <- lapply(cTbls, prop.test, correct = FALSE)
## preserve names
names(results) <- unique(a)

## only one result displayed for sake of space:
results$b

    2-sample test for equality of proportions without continuity
    correction

data:  X[[i]]
X-squared = 0.18382, df = 1, p-value = 0.6681
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.2557295  0.1638177
sample estimates:
   prop 1    prop 2 
0.4852941 0.5312500

Be aware, however, that is you might not want to interpret your p-values without correcting for multiple comparisons. A quick simulation demonstrates that the chance of incorrectly rejecting the null hypothesis with at least one of of your tests can be dramatically higher than 5%(!) :

set.seed(11)

sum(
  replicate(1e4, {
    a <- sample(letters[1:4], 100, replace = T)
    b <- sample(0:1, 100, replace = T)
    df <- data.frame(a,b)
    cTbls  <- lapply(unique(a), function(x) table(df$a==x, df$b))
    results <- lapply(cTbls, prop.test, correct = FALSE)
    any(lapply(results, function(x) x$p.value < .05))
  })
) / 1e4
[1] 0.1642

R: Testing each level of a factor without creating new variables

Note

Answers (2)

Related Questions