Reputation: 151
I am new to Chi-Squared Test. I have a database with lots of categorical variable.
Sample database with few variables are:
I want to apply the CHi-Squared test in R and want to find the p-values of all these categorical variable. Based on that i will rank my variables and delete the least important variables.
Can you advise me that how can i find the p-values of all the above variables in R.
As i know that Chi-Square can only be applied on 2 categorical variables but i have many categorical variables. How can do this?
Upvotes: 5
Views: 13038
Reputation: 19494
You can use lapply
to do repeated tasks, here a chi-squared test on multiple columns of a data frame with the first column.
CHIS <- lapply(data[,-1], function(x) chisq.test(data[,1], x)); CHIS
The result is a list, which can be combined in a nicer viewable format using do.call
and rbind
.
do.call(rbind, CHIS)[,c(1,3)]
statistic parameter p.value
X1 0.08680556 1 0.7682782
X2 0.9695384 1 0.3247953
X3 9.464545e-31 1 1
X4 0.9695384 1 0.3247953
X5 0.78125 1 0.3767591
Or perhaps using the tidy
function from broom.
library(broom)
do.call(rbind, lapply(CHIS, tidy))
# A tibble: 5 x 4
statistic p.value parameter method
* <dbl> <dbl> <int> <chr>
1 8.68e- 2 0.768 1 Pearson's Chi-squared test with Yates' continuity correction
2 9.70e- 1 0.325 1 Pearson's Chi-squared test with Yates' continuity correction
3 9.46e-31 1.00 1 Pearson's Chi-squared test with Yates' continuity correction
4 9.70e- 1 0.325 1 Pearson's Chi-squared test with Yates' continuity correction
5 7.81e- 1 0.377 1 Pearson's Chi-squared test with Yates' continuity correction
But unfortunately the names disappear. The rbindlist
function from data.table has an optional idcol
argument to preserve the names from the original list.
library(data.table)
rbindlist(lapply(CHIS, tidy), idcol=TRUE)
.id statistic p.value parameter
1: X1 8.680556e-02 0.7682782 1
2: X2 9.695384e-01 0.3247953 1
3: X3 9.464545e-31 1.0000000 1
4: X4 9.695384e-01 0.3247953 1
5: X5 7.812500e-01 0.3767591 1
Reproducible example:
nvars=5; nrows=50
set.seed(123)
X <- data.frame(matrix(sample(c(0,1), size=nrows*nvars, replace=TRUE), nc=nvars))
data <- data.frame(AppCategory=c(rep("Benign", 20), rep("Malware", 30)), X)
str(data)
'data.frame': 50 obs. of 6 variables:
$ AppCategory: Factor w/ 2 levels "Benign","Malware": 1 1 1 1 1 1 1 1 1 1 ...
$ X1 : num 0 0 0 1 0 1 1 1 0 0 ...
$ X2 : num 1 0 0 0 0 1 1 0 1 0 ...
$ X3 : num 0 1 1 0 1 1 0 0 0 1 ...
$ X4 : num 0 1 0 1 0 0 0 0 0 0 ...
$ X5 : num 1 1 1 0 1 1 1 0 1 1 ...
Upvotes: 4
Reputation: 182
First review all the details here: performing a chi square test across multiple variables and extracting the relevant p value in R Then see similar solution code below:
> # Assuming your dataframe is something like:
> x1 <- sample(1:7,5,replace = F)
> x2 <- sample(2:7,5,replace = T)
> x3 <- sample(1:6,5,replace = T)
> x4 <- sample(3:8,5,replace = T)
> y <- sample(1:100,5,replace = F)
> df <- data.frame(cbind(x1,x2,x3,x4,y))
> ?sample
> mapply(function(x, y) chisq.test(x, y)$p.value, df[, -5], MoreArgs=list(df[,5]))
x1 x2 x3 x4
0.2202206 0.2202206 0.2872975 0.2414365
# Note this is just a schema - you will need to adapt & align statistical nuances...
Upvotes: 0