Reputation: 53
I have binary data depending on whether an individual pass/failed a test, as well as characteristic information (e.g. gender) and what department they belonged to (e.g. x,y,z) in df(data)
head(data,9)
department gender pass
x Male 1
y Female 1
y Male 0
y Male 1
x Female 1
z Female 0
z Male 1
x Male 0
z Female 0
I can easily run chi-square tests on relationship between gender and passing with:
chisq.test(data$gender, data$pass)
But is there a way that this can be run separately for values in 'department' (x,y,z) without having to manually subset the data each time?
I can create a new dataframe that breaks down the overall pass rate for each department using tapply:
as.data.frame(tapply(data$pass, data$department,mean))
But is there a way i can add a new variable which indicates the result of the test outlined above (let's say p-value)?
Upvotes: 5
Views: 3562
Reputation: 3923
Not exactly a different answer to your question but an answer if you're trying to answer a different question. @JasonAizkalns has given you an elegant answer for each department but if you're interested in comparing departments with each other you need to adjust for multiple comparisons. So it might look something like this.
library(dplyr)
library(rcompanion)
df <- data.frame(
stringsAsFactors = FALSE,
department = c("x", "y", "y", "y", "x", "z", "z", "x", "z"),
gender = c("Male","Female","Male",
"Male","Female","Female","Male","Male","Female"),
pass = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L)
)
df %>%
group_by(department, gender) %>%
summarise(Freq = n()) %>%
xtabs(formula = Freq ~ ., data = .) %>%
pairwiseNominalIndependence(x = ., method = "holm", gtest = FALSE)
#> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
#> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
#> Warning in chisq.test(Dataz, ...): Chi-squared approximation may be incorrect
#> Comparison p.Fisher p.adj.Fisher p.Chisq p.adj.Chisq
#> 1 x : y 1 1 1 1
#> 2 x : z 1 1 1 1
#> 3 y : z 1 1 1 1
Upvotes: 2
Reputation: 72593
Yes there is! Using by
.
res <- do.call(rbind, by(dat, dat$department, function(x) {
c(M=unname(tapply(x$pass, x$department, mean)),
p=chisq.test(x$gender, x$pass)$p.value)
}))
res
# M p
# x 0.6788732 1.484695e-18
# y 0.6516517 3.045009e-22
# z 0.3205128 7.945768e-69
Data:
dat <- read.table(text="department gender pass
x Male 1
y Female 1
y Male 0
y Male 1
x Female 1
z Female 0
z Male 1
x Male 0
z Female 0", header=T)
set.seed(42)
dat <- dat[sample(1:nrow(dat), 1000, replace=T), ]
Upvotes: 0
Reputation: 20463
Using broom
with dplyr
is an elegant approach to this. First we group by the department variable and nest up our data frame. We then run the chisq.test
against each "subset". Finally, to pull off the relevant statistics (e.g. p.value
) we leverage broom::tidy
. Since these are all nested with each subset, we un-nest whatever components we ultimately want to see.
See this vignette for more details
library(tidyverse)
library(broom)
df <- data.frame(
stringsAsFactors = FALSE,
department = c("x", "y", "y", "y", "x", "z", "z", "x", "z"),
gender = c("Male","Female","Male",
"Male","Female","Female","Male","Male","Female"),
pass = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L)
)
df %>%
group_by(department) %>%
nest() %>%
mutate(
chi_test = map(data, ~ chisq.test(.$gender, .$pass)),
tidied = map(chi_test, tidy)
) %>%
unnest(tidied)
#> # A tibble: 3 x 7
#> # Groups: department [3]
#> department data chi_test statistic p.value parameter method
#> <chr> <list> <list> <dbl> <dbl> <int> <chr>
#> 1 x <tibble ~ <htest> 4.62e-32 1.00 1 Pearson's Chi-squar~
#> 2 y <tibble ~ <htest> 4.62e-32 1.00 1 Pearson's Chi-squar~
#> 3 z <tibble ~ <htest> 1.88e- 1 0.665 1 Pearson's Chi-squar~
Created on 2020-05-20 by the reprex package (v0.3.0)
If you want to use base R, you could leverage split
and lapply
with something like this:
lapply(split(df, df$department), function(x) { chisq.test(x$gender, x$pass)$p.value })
Upvotes: 4