Reputation: 33
I have a lage data set with cases and controls and multiple scores that i want to normalize based on mean and standard deviation in the controls. I want the z-scores to be added to the datafram and have the name of the original variable with a "z" infront
A sample of the data can be created with the following code (however the full data set contains several more variables):
df <- structure(
list(
diagnosis = structure(c(1L, 1L, 3L, 2L, 2L, 3L, 2L, 1L, 2L, 1L, 1L, 3L, 3L, 1L, 3L, 1L, 1L, 3L, 3L, 3L),
levels = c("control", "disease1", "disease2"), class = "factor"),
ly = c(-0.00121860465, -0.00135562539, -0.000320672533, -0.00121297926, -0.00149409177, -0.000911659334, -0.000955248272, -0.000965053426, -0.00118177247, -0.0011801697, -0.00146885607, -0.000572099309, -0.000875518542, -0.000847725959, -0.00133010695, -0.000926367067, -0.000976587052, -0.000998975487, -0.00177083595, -0.00170147706),
as = c(-0.0892422222, -0.0927111112, -0.127388889, -0.113477778, -0.127388880, -0.0827111111, -0.111044444, -0.102133333, -0.106588889, -0.105422222, -0.0789352352, -0.0927111111, -0.102133333, -0.127388889, -0.137388889, -0.127727778, -0.127388889, -0.116044445, -0.116044444, -0.1205239847),
ad = c(-0.00126427049, -0.00115668877, -0.00113243494, -0.00117873892, -0.00109579533, -0.00116917575, -0.00126072796, -0.00123688619, -0.00103635843, -0.00119274793, -0.00117878182, -0.00112272946, -0.00126483324, -0.00107161339, -0.00114657463, -0.00119816341, -0.00118735667, -0.00113913769, -0.00113432402, -0.00115775817)), row.names = c(NA, 20L), class = "data.frame")
I have this far created the scores one by one with the following code which gives the desired output:
df$zly<-(df$ly-mean(df$ly[df$diagnosis=="control"]))/sd(df$ly[df$diagnosis=="control"])
df$zas<-(df$as-mean(df$as[df$diagnosis=="control"]))/sd(df$as[df$diagnosis=="control"])
df$zad<-(df$ad-mean(df$ad[df$diagnosis=="control"]))/sd(df$ad[df$diagnosis=="control"])
However I have to create >20 z-scores in more 4 different dataframes and wish to make the code more efficient. Any suggestions?
Upvotes: 0
Views: 673
Reputation: 12461
Here's a solution based on the tidyverse
library(tidyverse)
myDf <- df %>%
mutate(
across(
-diagnosis,
function(x) {
t <- df %>% filter(diagnosis == "control")
m <- t %>%
summarise(mean=mean(t[[cur_column()]]), sd=sd(t[[cur_column()]]))
(x - m$mean)/m$sd
},
.names="z{col}"
)
)
myDf
diagnosis ly as ad zly zas zad
1 control -0.0012186047 -0.08924222 -0.001264270 -0.4548613 0.89016060 -1.37152284
2 control -0.0013556254 -0.09271111 -0.001156689 -1.0705367 0.70986278 0.50913743
3 disease2 -0.0003206725 -0.12738889 -0.001132435 3.5798178 -1.09253780 0.93312409
<output truncated>
As a credibility check, look to see if the means and SDs of the z
variables in the control group are 0 and 1 respectively.
myDf %>%
group_by(diagnosis) %>%
summarise(across(starts_with("z"), c("mean"=mean, "sd"=sd)))
# A tibble: 3 × 7
diagnosis zly_mean zly_sd zas_mean zas_sd zad_mean zad_sd
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 control 4.51e-16 1 1.08e-16 1 1.43e-15 1
2 disease1 -4.21e- 1 0.993 -4.29e- 1 0.466 7.50e- 1 1.71
3 disease2 2.57e- 1 2.30 -2.86e- 1 0.945 4.80e- 1 0.795
All OK
To apply the transformation to different data frames, wrap it in a function:
myFunc <- function(data) {
data %>%
mutate(
across(
-diagnosis,
function(x) {
t <- df %>% filter(diagnosis == "control")
m <- t %>%
summarise(mean=mean(t[[cur_column()]]), sd=sd(t[[cur_column()]]))
(x - m$mean)/m$sd
},
.names="z{col}"
)
)
}
df %>% myFunc()
Giving the same out put as above.
across
applies a function (given by its second argument) to the columns specified by its first argument. The name(s) of the derived column(s) are given by the third argument, which should be a glue specification. If the third argument is omitted, the input columns are overwritten.
Edit summary There was an error in my original answer: I incorrectly handled the filtering of the input dataset when calculating the mean and SD of the control group.
Upvotes: 3