Affan
Affan

Reputation: 151

Apply Chi-Squared Test in R on more than 5 variables and find the p-values

I am new to Chi-Squared Test. I have a database with lots of categorical variable.

Sample database with few variables are:

enter image description here

I want to apply the CHi-Squared test in R and want to find the p-values of all these categorical variable. Based on that i will rank my variables and delete the least important variables.

Can you advise me that how can i find the p-values of all the above variables in R.

As i know that Chi-Square can only be applied on 2 categorical variables but i have many categorical variables. How can do this?

Upvotes: 5

Views: 13038

Answers (2)

Edward
Edward

Reputation: 19494

You can use lapply to do repeated tasks, here a chi-squared test on multiple columns of a data frame with the first column.

CHIS <- lapply(data[,-1], function(x) chisq.test(data[,1], x)); CHIS

The result is a list, which can be combined in a nicer viewable format using do.call and rbind.

do.call(rbind, CHIS)[,c(1,3)]
   statistic    parameter p.value  
X1 0.08680556   1         0.7682782
X2 0.9695384    1         0.3247953
X3 9.464545e-31 1         1        
X4 0.9695384    1         0.3247953
X5 0.78125      1         0.3767591

Or perhaps using the tidy function from broom.

library(broom)

do.call(rbind, lapply(CHIS, tidy))

# A tibble: 5 x 4
  statistic p.value parameter method                                                      
*     <dbl>   <dbl>     <int> <chr>                                                       
1  8.68e- 2   0.768         1 Pearson's Chi-squared test with Yates' continuity correction
2  9.70e- 1   0.325         1 Pearson's Chi-squared test with Yates' continuity correction
3  9.46e-31   1.00          1 Pearson's Chi-squared test with Yates' continuity correction
4  9.70e- 1   0.325         1 Pearson's Chi-squared test with Yates' continuity correction
5  7.81e- 1   0.377         1 Pearson's Chi-squared test with Yates' continuity correction

But unfortunately the names disappear. The rbindlist function from data.table has an optional idcol argument to preserve the names from the original list.

library(data.table)
rbindlist(lapply(CHIS, tidy), idcol=TRUE)

   .id    statistic   p.value parameter
1:  X1 8.680556e-02 0.7682782         1
2:  X2 9.695384e-01 0.3247953         1
3:  X3 9.464545e-31 1.0000000         1
4:  X4 9.695384e-01 0.3247953         1
5:  X5 7.812500e-01 0.3767591         1

Reproducible example:

nvars=5; nrows=50
set.seed(123)
X <- data.frame(matrix(sample(c(0,1), size=nrows*nvars, replace=TRUE), nc=nvars))
data <- data.frame(AppCategory=c(rep("Benign", 20), rep("Malware", 30)), X)
str(data)

'data.frame':   50 obs. of  6 variables:
 $ AppCategory: Factor w/ 2 levels "Benign","Malware": 1 1 1 1 1 1 1 1 1 1 ...
 $ X1         : num  0 0 0 1 0 1 1 1 0 0 ...
 $ X2         : num  1 0 0 0 0 1 1 0 1 0 ...
 $ X3         : num  0 1 1 0 1 1 0 0 0 1 ...
 $ X4         : num  0 1 0 1 0 0 0 0 0 0 ...
 $ X5         : num  1 1 1 0 1 1 1 0 1 1 ...

Upvotes: 4

aiatay7n
aiatay7n

Reputation: 182

First review all the details here: performing a chi square test across multiple variables and extracting the relevant p value in R Then see similar solution code below:

> # Assuming your dataframe is something like: 
> x1 <- sample(1:7,5,replace = F)
> x2 <- sample(2:7,5,replace = T)
> x3 <- sample(1:6,5,replace = T)
> x4 <- sample(3:8,5,replace = T)
> y <- sample(1:100,5,replace = F)
> df <- data.frame(cbind(x1,x2,x3,x4,y))
> ?sample
> mapply(function(x, y) chisq.test(x, y)$p.value, df[, -5], MoreArgs=list(df[,5]))
       x1        x2        x3        x4 
0.2202206 0.2202206 0.2872975 0.2414365 
# Note this is just a schema - you will need to adapt & align statistical nuances...

Upvotes: 0

Related Questions