Reputation: 1847
I would like to standardize variables in R. I know about multiple approahces how this can be done. However, I realy like using this approach bellow:
library(tidyverse)
df <- mtcars
df %>%
gather() %>%
group_by(key) %>%
mutate(value = value - mean(value)) %>%
ungroup() %>%
pivot_wider(names_from = key, values_from = value)
For some reason this approach does not work since I am not able to return the data to the original format. Therefore, I would like to ask for advice
Upvotes: 8
Views: 9837
Reputation: 589
df %>%
gather() %>%
group_by(key) %>%
mutate(row = row_number(), value = value - mean(value)) %>%
pivot_wider(names_from = key, values_from = value) %>%
select(-row)
Upvotes: 0
Reputation: 18595
According to the current documentation, you should be using across
-based syntax to perform operations on a desired subset of columns. You can use everything
to select all columns or use any other available qualifier. You should only use group_by
verb if your desire is to perform operation on groups. group_by
is not right choice for selecting variables.
mtcars %>%
as_tibble() %>%
mutate(across(where(is.numeric), ~ . - mean(.)))
As for the actual standardisation or any other operation you want to apply to the subset of columns you can use:
.fns Functions to apply to each of the selected columns. Possible values are:
NULL
, to returns the columns untransformed.- A function, e.g.
mean
.- A purrr-style lambda, e.g.
~ mean(.x, na.rm = TRUE)
- A list of functions/lambdas, e.g.
list(mean = mean, n_miss = ~ sum(is.na(.x))
So for scale
you can do:
mtcars %>%
as_tibble() %>%
mutate(across(where(is.numeric), scale))
or with additional arguments
mtcars %>%
as_tibble() %>%
mutate(across(where(is.numeric), scale, center = FALSE))
As you can see from ?scale
documentation, the function returns matrix. In case of the examples above, you will get matrix with one column if this bothers you, you can do:
mtcars %>%
as_tibble() %>%
mutate(across(where(is.numeric), ~ scale(.)[,1]))
>> mtcars %>%
... as_tibble() %>%
... mutate(across(where(is.numeric), ~ scale(.)[,1])) %>%
... glimpse()
Rows: 32
Columns: 11
$ mpg <dbl> 0.15088482, 0.15088482, 0.44954345, 0.21725341, -0.23073453, -0.33028740, -0.96078…
$ cyl <dbl> -0.1049878, -0.1049878, -1.2248578, -0.1049878, 1.0148821, -0.1049878, 1.0148821, …
$ disp <dbl> -0.57061982, -0.57061982, -0.99018209, 0.22009369, 1.04308123, -0.04616698, 1.0430…
$ hp <dbl> -0.53509284, -0.53509284, -0.78304046, -0.53509284, 0.41294217,
...
>>
>>
>> mtcars %>%
... as_tibble() %>%
... mutate(across(where(is.numeric), scale)) %>%
... glimpse()
Rows: 32
Columns: 11
$ mpg <dbl[,1]> <matrix[32 x 1]>
$ cyl <dbl[,1]> <matrix[32 x 1]>
$ disp <dbl[,1]> <matrix[32 x 1]>
$ hp <dbl[,1]> <matrix[32 x 1]>
...
Upvotes: 9
Reputation: 488
The warning message is heplful:
Values are not uniquely identified; output will contain list-cols.
You need a column that uniquely identifies each row:
df %>%
gather() %>%
group_by(key) %>%
mutate(row = row_number(), value = value - mean(value)) %>%
pivot_wider(names_from = key, values_from = value) %>%
select(-row)
You'll probably want to add back rownames to the result.
Upvotes: 0
Reputation: 9267
It is not clear why you first make data in long format and then return back to wide, neither why you don't prefer scale(df)
which is computationally much faster.
Anyway, if you really want to use a code similar to the one you like, you need to perform a further unnest
action in order to return the data to the original format.
df %>%
gather() %>%
group_by(key) %>%
mutate(value = value - mean(value)) %>%
ungroup() %>%
pivot_wider(names_from = key, values_from = value) %>%
unnest(everything())
# A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.909 -0.188 -70.7 -36.7 0.303 -0.597 -1.39 -0.438 0.594 0.312 1.19
# 2 0.909 -0.188 -70.7 -36.7 0.303 -0.342 -0.829 -0.438 0.594 0.312 1.19
# 3 2.71 -2.19 -123. -53.7 0.253 -0.897 0.761 0.562 0.594 0.312 -1.81
# 4 1.31 -0.188 27.3 -36.7 -0.517 -0.00225 1.59 0.562 -0.406 -0.688 -1.81
# 5 -1.39 1.81 129. 28.3 -0.447 0.223 -0.829 -0.438 -0.406 -0.688 -0.812
# 6 -1.99 -0.188 -5.72 -41.7 -0.837 0.243 2.37 0.562 -0.406 -0.688 -1.81
# 7 -5.79 1.81 129. 98.3 -0.387 0.353 -2.01 -0.438 -0.406 -0.688 1.19
# 8 4.31 -2.19 -84.0 -84.7 0.0934 -0.0272 2.15 0.562 -0.406 0.312 -0.812
# 9 2.71 -2.19 -89.9 -51.7 0.323 -0.0673 5.05 0.562 -0.406 0.312 -0.812
# 10 -0.891 -0.188 -63.1 -23.7 0.323 0.223 0.451 0.562 -0.406 0.312 1.19
# ... with 22 more rows
In order to use the most up-to-date tidyr
functions, you should consider substituting gather
, which is retired now, with pivot_longer
as in the following code. The obtained result is the same.
df %>%
pivot_longer(everything()) %>%
group_by(name) %>%
mutate(value = value - mean(value)) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value) %>%
unnest(everything())
Upvotes: 1