Apply Chi-Squared Test in R on more than 5 variables and find the p-values

Question

I am new to Chi-Squared Test. I have a database with lots of categorical variable.

Sample database with few variables are:

I want to apply the CHi-Squared test in R and want to find the p-values of all these categorical variable. Based on that i will rank my variables and delete the least important variables.

Can you advise me that how can i find the p-values of all the above variables in R.

As i know that Chi-Square can only be applied on 2 categorical variables but i have many categorical variables. How can do this?

Edward · Accepted Answer

You can use lapply to do repeated tasks, here a chi-squared test on multiple columns of a data frame with the first column.

CHIS <- lapply(data[,-1], function(x) chisq.test(data[,1], x)); CHIS

The result is a list, which can be combined in a nicer viewable format using do.call and rbind.

do.call(rbind, CHIS)[,c(1,3)]
   statistic    parameter p.value  
X1 0.08680556   1         0.7682782
X2 0.9695384    1         0.3247953
X3 9.464545e-31 1         1        
X4 0.9695384    1         0.3247953
X5 0.78125      1         0.3767591

Or perhaps using the tidy function from broom.

library(broom)

do.call(rbind, lapply(CHIS, tidy))

# A tibble: 5 x 4
  statistic p.value parameter method                                                      
*                                                                     
1  8.68e- 2   0.768         1 Pearson's Chi-squared test with Yates' continuity correction
2  9.70e- 1   0.325         1 Pearson's Chi-squared test with Yates' continuity correction
3  9.46e-31   1.00          1 Pearson's Chi-squared test with Yates' continuity correction
4  9.70e- 1   0.325         1 Pearson's Chi-squared test with Yates' continuity correction
5  7.81e- 1   0.377         1 Pearson's Chi-squared test with Yates' continuity correction

But unfortunately the names disappear. The rbindlist function from data.table has an optional idcol argument to preserve the names from the original list.

library(data.table)
rbindlist(lapply(CHIS, tidy), idcol=TRUE)

   .id    statistic   p.value parameter
1:  X1 8.680556e-02 0.7682782         1
2:  X2 9.695384e-01 0.3247953         1
3:  X3 9.464545e-31 1.0000000         1
4:  X4 9.695384e-01 0.3247953         1
5:  X5 7.812500e-01 0.3767591         1

Reproducible example:

nvars=5; nrows=50
set.seed(123)
X <- data.frame(matrix(sample(c(0,1), size=nrows*nvars, replace=TRUE), nc=nvars))
data <- data.frame(AppCategory=c(rep("Benign", 20), rep("Malware", 30)), X)
str(data)

'data.frame':   50 obs. of  6 variables:
 $ AppCategory: Factor w/ 2 levels "Benign","Malware": 1 1 1 1 1 1 1 1 1 1 ...
 $ X1         : num  0 0 0 1 0 1 1 1 0 0 ...
 $ X2         : num  1 0 0 0 0 1 1 0 1 0 ...
 $ X3         : num  0 1 1 0 1 1 0 0 0 1 ...
 $ X4         : num  0 1 0 1 0 0 0 0 0 0 ...
 $ X5         : num  1 1 1 0 1 1 1 0 1 1 ...

Apply Chi-Squared Test in R on more than 5 variables and find the p-values

Answers (2)

Related Questions