Jan
Jan

Reputation: 4206

create matrix of z-scores in R

I have a data.frame containing survey data on three binary variables. The data is already in a contingency table with the first 3 columns being answers (1=yes, 0 = no) and the fourth column showing the total number of answers. The rows is three different groups.

My aim is to calulate z-scores to check if the proportions are significantly different compared to the total

this is my data:

library(dplyr) #loading libraries
df <- structure(list(var1 = c(416, 1300, 479, 417), 
                     var2 = c(265, 925,473, 279),
                     var3 = c(340, 1013, 344, 284),
                     totalN = c(1366, 4311,1904, 1233)),
                class = "data.frame",
                row.names = c(NA, -4L),
                .Names = c("var1","var2", "var3", "totalN"))

and these are my total values

dfTotal <-  df %>% summarise_all(funs(sum(., na.rm=TRUE)))
dfTotal
dfTotal <- data.frame(dfTotal)
rownames(dfTotal) <- "Total"

to calculate zScore I use the following formula:

zScore <- function (cntA, totA, cntB, totB) {
  #calculate
  avgProportion <- (cntA + cntB) / (totA + totB)
  probA <- cntA/totA
  probB <- cntB/totB
  SE <- sqrt(avgProportion * (1-avgProportion)*(1/totA + 1/totB))
  zScore <- (probA-probB) / SE
  return (zScore)
}

is there a way using dplyr to calculate a 4x3 matrix that holds for all four groups and variables var1 to var3 the z-test-value against the total proportion?

I am currently stuck with this bit of code:

df %>% mutate_all(funs(zScore(., totalN,dftotal$var1,dfTotal$totalN)))

So the parameters currently used here as dftotal$var1 and dfTotal$totalN don't work, but I have no idea how to feed them into the formula. for the first parameter it must not be always var1 but should be var2, var3 (and totalN) to match the first parameter.

Upvotes: 0

Views: 6729

Answers (2)

mtoto
mtoto

Reputation: 24178

If you want to use your zScore function inside a dplyr pipeline, we'll need to tidy your data first and add new variables containing the values you now have in dfTotal:

library(dplyr)
library(tidyr)

        # add grouping variables we'll need further down
df %>% mutate(group = 1:4) %>% 
        # reshape data to long format
        gather(question,count,-group,-totalN) %>%
        # add totals by question to df
        group_by(question) %>%
        mutate(answers = sum(totalN),
               yes = sum(count)) %>%
        # calculate z-scores by group against total
        group_by(group,question) %>%
        summarise(z_score = zScore(count, totalN, yes, answers)) %>%
        # spread to wide format
        spread(question, z_score)
## A tibble: 4 x 4
#  group       var1       var2      var3
#* <int>      <dbl>      <dbl>     <dbl>
#1     1  0.6162943 -2.1978303  1.979278
#2     2  0.6125615 -0.7505797  1.311001
#3     3 -3.9106430  2.6607258 -4.232391
#4     4  2.9995381  0.4712734  0.438899

Upvotes: 1

Pierre Lapointe
Pierre Lapointe

Reputation: 16277

z-score in R is handled with scale:

scale(df)
           var1        var2       var3     totalN
[1,] -0.5481814 -0.71592544 -0.4483732 -0.5837722
[2,]  1.4965122  1.42698064  1.4952995  1.4690147
[3,] -0.4024623 -0.04058534 -0.4368209 -0.2087639
[4,] -0.5458684 -0.67046986 -0.6101053 -0.6764787

If you want only the three var columns:

scale(df[,1:3])
           var1        var2       var3
[1,] -0.5481814 -0.71592544 -0.4483732
[2,]  1.4965122  1.42698064  1.4952995
[3,] -0.4024623 -0.04058534 -0.4368209
[4,] -0.5458684 -0.67046986 -0.6101053

Upvotes: 4

Related Questions