hugh_man
hugh_man

Reputation: 399

Automate z-score calculation by group

I have the following data frame:

df<- splitstackshape::stratified(iris, group="Species", size=1)

I want to make a z-score for each species including all of the variables. I can do this manually by finding the SD and mean for each row and using the appropriate formula, but I need to do this several times over and would like to find a more efficient way.

I tried using scale(), but can't figure out how to get it to do the row-wise calculation that includes several variables and a grouping variable.

Using dplyr::group_by returns a "'x' must be numeric variable" error.

Upvotes: 1

Views: 699

Answers (1)

liiiiiin
liiiiiin

Reputation: 129

Are you sure the question is taking a z-score to each group? It should be for each value.

Lets say the functions to take z-score could be:

scale(x, center = TRUE, scale = TRUE)

Or

function_zscore = function(x){x <- x[na.rm = TRUE]; return(((x) - mean(x)) / sd(x))}

Both functions suggest that if the argument x is a vector, the results will return to a vector too.

df<- splitstackshape::stratified(iris, group="Species", size=1)

df <- tidyr::pivot_longer(df, cols = c(1:4), names_to = "var.name", values_to = "value")

df %>% 
  group_by(Species) %>% 
  mutate(zscore = scale(value, center = TRUE, scale = TRUE)[,1])

## A tibble: 12 x 4
## Groups:   Species [3]
#   Species    var.name     value zscore
#   <fct>      <chr>        <dbl>  <dbl>
# 1 setosa     Sepal.Length   4.9  1.22 
# 2 setosa     Sepal.Width    3.1  0.332
# 3 setosa     Petal.Length   1.5 -0.455
# 4 setosa     Petal.Width    0.2 -1.09 
# 5 versicolor Sepal.Length   5.9  1.10 
# 6 versicolor Sepal.Width    3.2 -0.403
# 7 versicolor Petal.Length   4.8  0.486
# 8 versicolor Petal.Width    1.8 -1.18 
# 9 virginica  Sepal.Length   6.5  1.14 
#10 virginica  Sepal.Width    3   -0.574
#11 virginica  Petal.Length   5.2  0.501
#12 virginica  Petal.Width    2   -1.06 

If we still hope to get a score for each group to describe how a sample deviates around the mean, a possible solution could be getting the coefficient of variation?

df %>% 
  group_by(Species) %>% 
  summarise(coef.var = 100*sd(value)/mean(value))

## A tibble: 3 x 2
#  Species    coef.var
#  <fct>         <dbl>
#1 setosa         83.8
#2 versicolor     45.8
#3 virginica      49.0

Upvotes: 3

Related Questions