Reputation: 4206
I have a data.frame containing survey data on three binary variables. The data is already in a contingency table with the first 3 columns being answers (1=yes, 0 = no) and the fourth column showing the total number of answers. The rows is three different groups.
My aim is to calulate z-scores to check if the proportions are significantly different compared to the total
this is my data:
library(dplyr) #loading libraries
df <- structure(list(var1 = c(416, 1300, 479, 417),
var2 = c(265, 925,473, 279),
var3 = c(340, 1013, 344, 284),
totalN = c(1366, 4311,1904, 1233)),
class = "data.frame",
row.names = c(NA, -4L),
.Names = c("var1","var2", "var3", "totalN"))
and these are my total values
dfTotal <- df %>% summarise_all(funs(sum(., na.rm=TRUE)))
dfTotal
dfTotal <- data.frame(dfTotal)
rownames(dfTotal) <- "Total"
to calculate zScore I use the following formula:
zScore <- function (cntA, totA, cntB, totB) {
#calculate
avgProportion <- (cntA + cntB) / (totA + totB)
probA <- cntA/totA
probB <- cntB/totB
SE <- sqrt(avgProportion * (1-avgProportion)*(1/totA + 1/totB))
zScore <- (probA-probB) / SE
return (zScore)
}
is there a way using dplyr to calculate a 4x3 matrix that holds for all four groups and variables var1 to var3 the z-test-value against the total proportion?
I am currently stuck with this bit of code:
df %>% mutate_all(funs(zScore(., totalN,dftotal$var1,dfTotal$totalN)))
So the parameters currently used here as dftotal$var1 and dfTotal$totalN don't work, but I have no idea how to feed them into the formula. for the first parameter it must not be always var1 but should be var2, var3 (and totalN) to match the first parameter.
Upvotes: 0
Views: 6729
Reputation: 24178
If you want to use your zScore
function inside a dplyr
pipeline, we'll need to tidy your data first and add new variables containing the values you now have in dfTotal
:
library(dplyr)
library(tidyr)
# add grouping variables we'll need further down
df %>% mutate(group = 1:4) %>%
# reshape data to long format
gather(question,count,-group,-totalN) %>%
# add totals by question to df
group_by(question) %>%
mutate(answers = sum(totalN),
yes = sum(count)) %>%
# calculate z-scores by group against total
group_by(group,question) %>%
summarise(z_score = zScore(count, totalN, yes, answers)) %>%
# spread to wide format
spread(question, z_score)
## A tibble: 4 x 4
# group var1 var2 var3
#* <int> <dbl> <dbl> <dbl>
#1 1 0.6162943 -2.1978303 1.979278
#2 2 0.6125615 -0.7505797 1.311001
#3 3 -3.9106430 2.6607258 -4.232391
#4 4 2.9995381 0.4712734 0.438899
Upvotes: 1
Reputation: 16277
z-score in R is handled with scale
:
scale(df)
var1 var2 var3 totalN
[1,] -0.5481814 -0.71592544 -0.4483732 -0.5837722
[2,] 1.4965122 1.42698064 1.4952995 1.4690147
[3,] -0.4024623 -0.04058534 -0.4368209 -0.2087639
[4,] -0.5458684 -0.67046986 -0.6101053 -0.6764787
If you want only the three var columns:
scale(df[,1:3])
var1 var2 var3
[1,] -0.5481814 -0.71592544 -0.4483732
[2,] 1.4965122 1.42698064 1.4952995
[3,] -0.4024623 -0.04058534 -0.4368209
[4,] -0.5458684 -0.67046986 -0.6101053
Upvotes: 4