Reputation: 3
I want to analysis categorical data with a chisq test in R. I am working with transplant data, I am looking to compare outcomes between on/off bypass at surgery. I have asked a similar question before regarding my categorical variables and was given this answer to test for group difference by sex:
df <- read.table(text="Group, Age, Sex, Height, Weight, Diagnosis, Blood loss, Intubation time, Survival
On bypass,59,Male,165,102,Diagnosis 1,57,53,29
On bypass,44,Female,164,140,Diagnosis 1,114,15,35
On bypass,45,Male,165,119,Diagnosis 2,118,31,81
On bypass,26,Male,178,125,Diagnosis 1,171,36,31
On bypass,41,Female,177,105,Diagnosis 1,76,53,91
On bypass,43,Male,161,119,Diagnosis 3,97,38,63
Off bypass,53,Female,164,139,Diagnosis 1,125,49,51
Off bypass,26,Female,165,137,Diagnosis 3,29,7,86
Off bypass,30,Male,174,121,Diagnosis 1,174,43,100
Off bypass,59,Female,174,133,Diagnosis 1,40,16,43
Off bypass,63,Male,172,132,Diagnosis 2,32,46,10 ", header = TRUE, sep = ",")
library(dplyr)
# tally number of participants in each Group by Sex
tab <- tally(group_by(df, Group, Sex))
chisq.test(tab$n) # test for Group differences by Sex
I have used this to test for differences between categories with two variables (such as sex, the two variables being male and female), however some of my categories have multiple variables, for example diagnosis (see my example data set below). For these categories I want to compare the difference between each diagnosis in on/off bypass groups.
Here is my exampledata:
exampledata <- read.table(text="ID,Bypass,Sex,Age,Height,Weight,Diagnosis
559,Bypass on,Male,33,167,78,Other
662,Bypass off,Male,63,175,55,UIP
956,Bypass off,Female,40,158,88,Other
460,Bypass on,Female,34,173,86,UIP
153,Bypass off,Female,31,171,74,UIP
192,Bypass off,Male,33,163,64,Other
658,Bypass on,Male,50,161,60,Other
529,Bypass off,Female,55,179,75,Cystic fibrosis
981,Bypass on,Male,36,166,81,Other
367,Bypass on,Female,46,152,85,PH
728,Bypass off,Male,30,169,88,Other
185,Bypass on,Female,65,162,57,UIP
160,Bypass on,Male,54,176,62,PH
175,Bypass off,Male,29,156,78,Other
167,Bypass off,Male,20,175,86,PH
149,Bypass on,Male,24,169,82,Cystic fibrosis
446,Bypass off,Male,38,162,69,PH
667,Bypass on,Male,55,150,55,Cystic fibrosis
488,Bypass off,Female,41,162,56,Other
169,Bypass off,Female,60,154,55,Cystic fibrosis
787,Bypass on,Male,41,169,52,Cystic fibrosis
443,Bypass on,Male,35,159,77,Other
593,Bypass off,Female,28,167,53,Other
653,Bypass off,Female,22,176,75,Other
685,Bypass off,Male,26,170,88,Cystic fibrosis
676,Bypass on,Male,32,172,58,Cystic fibrosis
556,Bypass off,Male,26,168,88,PH
943,Bypass off,Male,40,176,80,PH
940,Bypass off,Male,37,180,69,Cystic fibrosis
740,Bypass on,Female,58,153,72,UIP
624,Bypass on,Female,40,156,81,UIP
194,Bypass on,Male,33,155,60,PH
162,Bypass on,Female,23,170,64,PH
283,Bypass off,Male,60,180,61,Other
404,Bypass on,Male,26,170,63,PH
312,Bypass on,Male,36,171,83,PH
995,Bypass on,Female,48,161,67,Other
254,Bypass on,Female,35,175,62,UIP
364,Bypass on,Female,65,161,55,UIP
771,Bypass off,Male,37,157,72,Other
698,Bypass on,Male,31,163,87,PH
286,Bypass on,Female,60,154,80,UIP
189,Bypass off,Male,42,168,57,PH
463,Bypass on,Female,32,176,50,PH
634,Bypass off,Male,53,152,64,UIP
198,Bypass off,Female,20,171,70,Cystic fibrosis
356,Bypass off,Male,55,161,72,Cystic fibrosis
254,Bypass on,Female,49,169,61,UIP
921,Bypass on,Male,47,152,63,UIP
185,Bypass on,Male,63,174,71,Other
953,Bypass on,Male,32,169,63,PH
336,Bypass on,Female,33,164,52,Other
651,Bypass off,Female,55,172,54,PH
200,Bypass off,Male,43,179,55,UIP
625,Bypass off,Male,43,158,75,Other
986,Bypass on,Female,32,151,81,Other
437,Bypass off,Female,53,152,57,Other
433,Bypass on,Male,35,180,74,Cystic fibrosis
673,Bypass on,Female,27,159,58,Cystic fibrosis
901,Bypass off,Male,30,169,72,PH", header = TRUE, sep = ",")
I am using this to create a table of counts:
mytable <- table(exampledata$Bypass,exampledata$Diagnosis)
returns
Cystic fibrosis Other PH UIP
Bypass off 6 11 7 4
Bypass on 6 8 9 9
However, as I wish to look at each diagnosis individually the output I require is
Cystic fibrosis Not Cystic fibrosis
Bypass off 6 22
Bypass on 6 26
I am hoping that using this output I can compare the number of patients that have Cystic fibrosis in the on/off pump groups.
Ideally I would then be able to quickly repeat this for each diagnosis.
If someone believes there is a better way of doing this (or I am just doing it the wrong way) then please advise.
Any help would be much appreciated.
Thanks, Tom
Upvotes: 0
Views: 169
Reputation: 10473
You can do something like this:
mytable <- table(exampledata$Bypass, exampledata$Diagnosis == 'Cystic fibrosis')
colnames(mytable) <- c('Not Cystic fibrosis', 'Cystic fibrosis')
Not Cystic fibrosis Cystic fibrosis
Bypass off 22 6
Bypass on 26 6
If you want this same thing done for all categories, you can do this in a function / loop.
EDIT: adding a loop option to get all the tables needed:
lapply(levels(exampledata$Diagnosis), function(x) {
mytable <- table(exampledata$Bypass, exampledata$Diagnosis == x)
colnames(mytable) <- c(paste('Not ', x, sep = ''), x)
mytable
})
Output is as follows:
[[1]]
Not Cystic fibrosis Cystic fibrosis
Bypass off 22 6
Bypass on 26 6
[[2]]
Not Other Other
Bypass off 17 11
Bypass on 24 8
[[3]]
Not PH PH
Bypass off 21 7
Bypass on 23 9
[[4]]
Not UIP UIP
Bypass off 24 4
Bypass on 23 9
To run all chi-square tests on each of the above tables, simply save the output of that above lapply
call to some variable - let us call l
.
Then use:
sapply(l, chisq.test)
Output should be a list of four summaries from the test(s).
Of course, once you save the lapply
output to a list l
, you can also run individual chi-square tests like:
chisq.test(l[[1]])
Upvotes: 1