Petr
Petr

Reputation: 1847

Standardize variables using dplyr [r]

I would like to standardize variables in R. I know about multiple approahces how this can be done. However, I realy like using this approach bellow:

library(tidyverse)

df <- mtcars

df %>% 
  gather() %>% 
  group_by(key) %>% 
  mutate(value = value - mean(value)) %>% 
  ungroup() %>% 
  pivot_wider(names_from = key, values_from = value)

For some reason this approach does not work since I am not able to return the data to the original format. Therefore, I would like to ask for advice

Upvotes: 8

Views: 9837

Answers (4)

taylor.2317
taylor.2317

Reputation: 589

df %>% 
  gather() %>% 
  group_by(key) %>% 
  mutate(row = row_number(), value = value - mean(value)) %>% 
  pivot_wider(names_from = key, values_from = value) %>%
  select(-row)

Upvotes: 0

Konrad
Konrad

Reputation: 18595

According to the current documentation, you should be using across-based syntax to perform operations on a desired subset of columns. You can use everything to select all columns or use any other available qualifier. You should only use group_by verb if your desire is to perform operation on groups. group_by is not right choice for selecting variables.

mtcars %>%
    as_tibble() %>%
    mutate(across(where(is.numeric), ~ . - mean(.)))

As for the actual standardisation or any other operation you want to apply to the subset of columns you can use:

.fns Functions to apply to each of the selected columns. Possible values are:

  • NULL, to returns the columns untransformed.
  • A function, e.g. mean.
  • A purrr-style lambda, e.g. ~ mean(.x, na.rm = TRUE)
  • A list of functions/lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))

So for scale you can do:

mtcars %>%
    as_tibble() %>%
    mutate(across(where(is.numeric), scale))

or with additional arguments

mtcars %>%
    as_tibble() %>%
    mutate(across(where(is.numeric), scale, center = FALSE))

Side notes

As you can see from ?scale documentation, the function returns matrix. In case of the examples above, you will get matrix with one column if this bothers you, you can do:

mtcars %>%
    as_tibble() %>%
    mutate(across(where(is.numeric),  ~ scale(.)[,1]))

Comparison

>> mtcars %>%
...     as_tibble() %>%
...     mutate(across(where(is.numeric),  ~ scale(.)[,1])) %>% 
...     glimpse()
Rows: 32
Columns: 11
$ mpg  <dbl> 0.15088482, 0.15088482, 0.44954345, 0.21725341, -0.23073453, -0.33028740, -0.96078…
$ cyl  <dbl> -0.1049878, -0.1049878, -1.2248578, -0.1049878, 1.0148821, -0.1049878, 1.0148821, …
$ disp <dbl> -0.57061982, -0.57061982, -0.99018209, 0.22009369, 1.04308123, -0.04616698, 1.0430…
$ hp   <dbl> -0.53509284, -0.53509284, -0.78304046, -0.53509284, 0.41294217, 
...
>> 
>> 
>> mtcars %>%
...     as_tibble() %>%
...     mutate(across(where(is.numeric), scale)) %>% 
...     glimpse()
Rows: 32
Columns: 11
$ mpg  <dbl[,1]> <matrix[32 x 1]>
$ cyl  <dbl[,1]> <matrix[32 x 1]>
$ disp <dbl[,1]> <matrix[32 x 1]>
$ hp   <dbl[,1]> <matrix[32 x 1]>
...

Upvotes: 9

Lukasz
Lukasz

Reputation: 488

The warning message is heplful:

Values are not uniquely identified; output will contain list-cols.

You need a column that uniquely identifies each row:

df %>% 
  gather() %>% 
  group_by(key) %>% 
  mutate(row = row_number(), value = value - mean(value)) %>% 
  pivot_wider(names_from = key, values_from = value) %>%
  select(-row)

You'll probably want to add back rownames to the result.

Upvotes: 0

Ric S
Ric S

Reputation: 9267

It is not clear why you first make data in long format and then return back to wide, neither why you don't prefer scale(df) which is computationally much faster.

Anyway, if you really want to use a code similar to the one you like, you need to perform a further unnest action in order to return the data to the original format.

df %>% 
  gather() %>% 
  group_by(key) %>% 
  mutate(value = value - mean(value)) %>% 
  ungroup() %>% 
  pivot_wider(names_from = key, values_from = value) %>% 
  unnest(everything())

# A tibble: 32 x 11
#       mpg    cyl    disp    hp    drat       wt   qsec     vs     am   gear   carb
#     <dbl>  <dbl>   <dbl> <dbl>   <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1  0.909 -0.188  -70.7  -36.7  0.303  -0.597   -1.39  -0.438  0.594  0.312  1.19 
#  2  0.909 -0.188  -70.7  -36.7  0.303  -0.342   -0.829 -0.438  0.594  0.312  1.19 
#  3  2.71  -2.19  -123.   -53.7  0.253  -0.897    0.761  0.562  0.594  0.312 -1.81 
#  4  1.31  -0.188   27.3  -36.7 -0.517  -0.00225  1.59   0.562 -0.406 -0.688 -1.81 
#  5 -1.39   1.81   129.    28.3 -0.447   0.223   -0.829 -0.438 -0.406 -0.688 -0.812
#  6 -1.99  -0.188   -5.72 -41.7 -0.837   0.243    2.37   0.562 -0.406 -0.688 -1.81 
#  7 -5.79   1.81   129.    98.3 -0.387   0.353   -2.01  -0.438 -0.406 -0.688  1.19 
#  8  4.31  -2.19   -84.0  -84.7  0.0934 -0.0272   2.15   0.562 -0.406  0.312 -0.812
#  9  2.71  -2.19   -89.9  -51.7  0.323  -0.0673   5.05   0.562 -0.406  0.312 -0.812
# 10 -0.891 -0.188  -63.1  -23.7  0.323   0.223    0.451  0.562 -0.406  0.312  1.19 
# ... with 22 more rows

EDIT

In order to use the most up-to-date tidyr functions, you should consider substituting gather, which is retired now, with pivot_longer as in the following code. The obtained result is the same.

df %>% 
  pivot_longer(everything()) %>% 
  group_by(name) %>% 
  mutate(value = value - mean(value)) %>% 
  ungroup() %>% 
  pivot_wider(names_from = name, values_from = value) %>% 
  unnest(everything())

Upvotes: 1

Related Questions