jonandet
jonandet

Reputation: 33

create z-score for multiple columns based on mean and sd in controls R

I have a lage data set with cases and controls and multiple scores that i want to normalize based on mean and standard deviation in the controls. I want the z-scores to be added to the datafram and have the name of the original variable with a "z" infront

A sample of the data can be created with the following code (however the full data set contains several more variables):

df <- structure(
  list(
    diagnosis = structure(c(1L, 1L, 3L, 2L, 2L, 3L, 2L, 1L, 2L, 1L, 1L, 3L, 3L, 1L, 3L, 1L, 1L, 3L, 3L, 3L),
                                    levels = c("control", "disease1", "disease2"), class = "factor"), 
    ly = c(-0.00121860465, -0.00135562539, -0.000320672533, -0.00121297926, -0.00149409177, -0.000911659334, -0.000955248272, -0.000965053426, -0.00118177247, -0.0011801697, -0.00146885607, -0.000572099309, -0.000875518542, -0.000847725959, -0.00133010695, -0.000926367067, -0.000976587052, -0.000998975487, -0.00177083595, -0.00170147706), 
    as = c(-0.0892422222, -0.0927111112, -0.127388889, -0.113477778, -0.127388880, -0.0827111111, -0.111044444, -0.102133333, -0.106588889, -0.105422222, -0.0789352352, -0.0927111111, -0.102133333, -0.127388889, -0.137388889, -0.127727778, -0.127388889, -0.116044445, -0.116044444, -0.1205239847), 
    ad = c(-0.00126427049, -0.00115668877, -0.00113243494, -0.00117873892, -0.00109579533, -0.00116917575, -0.00126072796, -0.00123688619, -0.00103635843, -0.00119274793, -0.00117878182, -0.00112272946, -0.00126483324, -0.00107161339, -0.00114657463, -0.00119816341, -0.00118735667, -0.00113913769, -0.00113432402, -0.00115775817)), row.names = c(NA, 20L), class = "data.frame")

I have this far created the scores one by one with the following code which gives the desired output:

df$zly<-(df$ly-mean(df$ly[df$diagnosis=="control"]))/sd(df$ly[df$diagnosis=="control"])
df$zas<-(df$as-mean(df$as[df$diagnosis=="control"]))/sd(df$as[df$diagnosis=="control"])
df$zad<-(df$ad-mean(df$ad[df$diagnosis=="control"]))/sd(df$ad[df$diagnosis=="control"])

However I have to create >20 z-scores in more 4 different dataframes and wish to make the code more efficient. Any suggestions?

Upvotes: 0

Views: 673

Answers (1)

Limey
Limey

Reputation: 12461

Here's a solution based on the tidyverse

library(tidyverse)

myDf <- df %>% 
  mutate(
    across(
      -diagnosis, 
      function(x) {
        t <- df %>% filter(diagnosis == "control")
        m <- t %>%
          summarise(mean=mean(t[[cur_column()]]), sd=sd(t[[cur_column()]]))
        (x - m$mean)/m$sd
      }, 
      .names="z{col}"
    )
  )
myDf
   diagnosis            ly          as           ad        zly         zas         zad
1    control -0.0012186047 -0.08924222 -0.001264270 -0.4548613  0.89016060 -1.37152284
2    control -0.0013556254 -0.09271111 -0.001156689 -1.0705367  0.70986278  0.50913743
3   disease2 -0.0003206725 -0.12738889 -0.001132435  3.5798178 -1.09253780  0.93312409
<output truncated>

As a credibility check, look to see if the means and SDs of the z variables in the control group are 0 and 1 respectively.

myDf %>% 
  group_by(diagnosis) %>% 
  summarise(across(starts_with("z"), c("mean"=mean, "sd"=sd)))
# A tibble: 3 × 7
  diagnosis  zly_mean zly_sd  zas_mean zas_sd zad_mean zad_sd
  <fct>         <dbl>  <dbl>     <dbl>  <dbl>    <dbl>  <dbl>
1 control    4.51e-16  1      1.08e-16  1     1.43e-15  1    
2 disease1  -4.21e- 1  0.993 -4.29e- 1  0.466 7.50e- 1  1.71 
3 disease2   2.57e- 1  2.30  -2.86e- 1  0.945 4.80e- 1  0.795

All OK

To apply the transformation to different data frames, wrap it in a function:

myFunc <- function(data) {
  data %>% 
    mutate(
      across(
        -diagnosis, 
        function(x) {
          t <- df %>% filter(diagnosis == "control")
          m <- t  %>%
          summarise(mean=mean(t[[cur_column()]]), sd=sd(t[[cur_column()]]))
          (x - m$mean)/m$sd
        }, 
        .names="z{col}"
      )
    )
}

df %>% myFunc()

Giving the same out put as above.

across applies a function (given by its second argument) to the columns specified by its first argument. The name(s) of the derived column(s) are given by the third argument, which should be a glue specification. If the third argument is omitted, the input columns are overwritten.

Edit summary There was an error in my original answer: I incorrectly handled the filtering of the input dataset when calculating the mean and SD of the control group.

Upvotes: 3

Related Questions