user3919790
user3919790

Reputation: 557

performing a chi square test across multiple variables and extracting the relevant p value in R

Ok straight to the question. I have a database with lots and lots of categorical variable.

Sample database with a few variables as below

gender <- as.factor(sample( letters[6:7], 100, replace=TRUE, prob=c(0.2, 0.8) ))    
smoking <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.6,0.4)))    
alcohol <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.3,0.7)))    
htn <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.2,0.8)))    
tertile <- as.factor(sample(c(1,2,3),size=100,replace=T,prob=c(0.3,0.3,0.4)))    
df <- as.data.frame(cbind(gender,smoking,alcohol,htn,tertile))

I want to test the hypothesis, using a chi square test, that there is a difference in the portion of smokers, alcohol use, hypertension (htn) etc by tertile (3 factors). I then want to extract the p values for each variable.

Now i know i can test each individual variable using a 2 by 3 cross tabulation but is there a more efficient code to derive the test statistic and p-value across all variables in one go and extract the p value across each variable

Thanks in advance

Anoop

Upvotes: 4

Views: 15147

Answers (2)

Mehmet Yildirim
Mehmet Yildirim

Reputation: 501

You can run the following code chunk if you want to get the test result in details:

lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE))

You can get just p-values:

lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value)

This is to get the p-values in the data frame:

data.frame(lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value))

Thanks to RPub for inspiring. http://www.rpubs.com/kaz_yos/1204

Upvotes: 1

MrFlick
MrFlick

Reputation: 206586

If you want to do all the comparisons in one statement, you can do

mapply(function(x, y) chisq.test(x, y)$p.value, df[, -5], MoreArgs=list(df[,5]))
#    gender   smoking   alcohol       htn 
# 0.4967724 0.8251178 0.5008898 0.3775083 

Of course doing tests this way is somewhat statistically inefficient since you are doing multiple tests here so some correction is required to maintain an appropriate type 1 error rate.

Upvotes: 3

Related Questions