Henry George
Henry George

Reputation: 53

Running multiple chi-squared tests for different categories

I have binary data depending on whether an individual pass/failed a test, as well as characteristic information (e.g. gender) and what department they belonged to (e.g. x,y,z) in df(data)

head(data,9)
department  gender   pass 
x           Male     1               
y           Female   1             
y           Male     0         
y           Male     1              
x           Female   1              
z           Female   0            
z           Male     1
x           Male     0
z           Female   0

I can easily run chi-square tests on relationship between gender and passing with:

chisq.test(data$gender, data$pass)

But is there a way that this can be run separately for values in 'department' (x,y,z) without having to manually subset the data each time?

I can create a new dataframe that breaks down the overall pass rate for each department using tapply:

as.data.frame(tapply(data$pass, data$department,mean))

But is there a way i can add a new variable which indicates the result of the test outlined above (let's say p-value)?

Upvotes: 5

Views: 3562

Answers (3)

Chuck P
Chuck P

Reputation: 3923

Not exactly a different answer to your question but an answer if you're trying to answer a different question. @JasonAizkalns has given you an elegant answer for each department but if you're interested in comparing departments with each other you need to adjust for multiple comparisons. So it might look something like this.

library(dplyr)
library(rcompanion)

df <- data.frame(
  stringsAsFactors = FALSE,
  department = c("x", "y", "y", "y", "x", "z", "z", "x", "z"),
  gender = c("Male","Female","Male",
             "Male","Female","Female","Male","Male","Female"),
  pass = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L)
)

df %>%
  group_by(department, gender) %>%
  summarise(Freq = n()) %>%
  xtabs(formula = Freq ~ ., data = .) %>% 
  pairwiseNominalIndependence(x = ., method = "holm", gtest = FALSE)

#> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect

#> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect

#> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
#>   Comparison p.Fisher p.adj.Fisher p.Chisq p.adj.Chisq
#> 1      x : y        1            1       1           1
#> 2      x : z        1            1       1           1
#> 3      y : z        1            1       1           1

Upvotes: 2

jay.sf
jay.sf

Reputation: 72593

Yes there is! Using by.

res <- do.call(rbind, by(dat, dat$department, function(x) {
  c(M=unname(tapply(x$pass, x$department, mean)),
    p=chisq.test(x$gender, x$pass)$p.value)
}))
res
#           M            p
# x 0.6788732 1.484695e-18
# y 0.6516517 3.045009e-22
# z 0.3205128 7.945768e-69

Data:

dat <- read.table(text="department  gender   pass 
x           Male     1               
y           Female   1             
y           Male     0         
y           Male     1              
x           Female   1              
z           Female   0            
z           Male     1
x           Male     0
z           Female   0", header=T)
set.seed(42)
dat <- dat[sample(1:nrow(dat), 1000, replace=T), ]

Upvotes: 0

JasonAizkalns
JasonAizkalns

Reputation: 20463

Using broom with dplyr is an elegant approach to this. First we group by the department variable and nest up our data frame. We then run the chisq.test against each "subset". Finally, to pull off the relevant statistics (e.g. p.value) we leverage broom::tidy. Since these are all nested with each subset, we un-nest whatever components we ultimately want to see.

See this vignette for more details

library(tidyverse)
library(broom)

df <- data.frame(
  stringsAsFactors = FALSE,
        department = c("x", "y", "y", "y", "x", "z", "z", "x", "z"),
            gender = c("Male","Female","Male",
                       "Male","Female","Female","Male","Male","Female"),
              pass = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L)
)


df %>%
  group_by(department) %>%
  nest() %>% 
  mutate(
    chi_test = map(data, ~ chisq.test(.$gender, .$pass)),
    tidied = map(chi_test, tidy)
  ) %>% 
  unnest(tidied)

#> # A tibble: 3 x 7
#> # Groups:   department [3]
#>   department data      chi_test statistic p.value parameter method              
#>   <chr>      <list>    <list>       <dbl>   <dbl>     <int> <chr>               
#> 1 x          <tibble ~ <htest>   4.62e-32   1.00          1 Pearson's Chi-squar~
#> 2 y          <tibble ~ <htest>   4.62e-32   1.00          1 Pearson's Chi-squar~
#> 3 z          <tibble ~ <htest>   1.88e- 1   0.665         1 Pearson's Chi-squar~

Created on 2020-05-20 by the reprex package (v0.3.0)

If you want to use base R, you could leverage split and lapply with something like this:

lapply(split(df, df$department), function(x) { chisq.test(x$gender, x$pass)$p.value })

Upvotes: 4

Related Questions