Reputation:
Often I have a data frame that contains a numerical variable and a categorical variable, and I want to split the numerical variable according to the categorical variable, perform some operation, and put it back together in the form of a data frame. The operation depends on the entire part of the numerical vector within a category, and will sometimes return a vector of a different length. I know how to do this in an ugly way (see
example below), but it seems like a common operation so I'm wondering if there's a simpler way I'm not aware of. I would be especially interested to know if there is a solution using the tidyverse
.
Here is an example of what I'm talking about.
df = data.frame(y=1:10, g=rep(c("a", "b"), each=5))
Say I want to standardize the variable y
to be between 0 and 1 for each level of the categorical variable. Here is a general way to do it:
do.call(
rbind,
lapply(unique(df$g),
function(level) {
y.current = df$y[df$g==level]
## perform some operation
y.new = (y.current-min(y.current))/
(max(y.current)-min(y.current))
return(data.frame(y=y.new,
g=level))
}
)
)
This requires a lot of typing and isn't very readable. Is there a better way?
Edit: Thanks for the great answers. The only thing I'm still interested in is a fully general method of doing this with the tidyverse
. If we change the example to an operation where the size of the numerical vector is reduced but greater than one, the group_by
/mutate
/summarize
combinations don't work. For example, say I want to remove the largest value within every group. I can do
library(dplyr)
df = data.frame(y=1:10, g=rep(c("a", "b"), each=5))
trans_df = df %>%
group_by(g) %>%
do(y=.$y[-which.max(.$y)])
The transformed data frame trans_df
has the grouping variable with one observation per level, and the transformed variable as a list for each level of the grouping variable. I can put this in the original format using base R with
data.frame(g=rep(trans_df$g, times=sapply(trans_df$y, length)),
y=do.call(c, trans_df$y))
but how can I do it using the tidyverse
?
Upvotes: 2
Views: 591
Reputation: 15072
This is classic split-apply-combine methodology. You group by a categorical variable, apply some function to the individual groups, and combine back together. In dplyr
this is handled by group_by
.
df <- data.frame(y=1:10, g=rep(c("a", "b"), each=5))
library(dplyr)
df %>%
group_by(g) %>%
mutate(y2 = (y - min(y)) / (max(y) - min(y)))
#> # A tibble: 10 x 3
#> # Groups: g [2]
#> y g y2
#> <int> <fct> <dbl>
#> 1 1 a 0
#> 2 2 a 0.25
#> 3 3 a 0.5
#> 4 4 a 0.75
#> 5 5 a 1
#> 6 6 b 0
#> 7 7 b 0.25
#> 8 8 b 0.5
#> 9 9 b 0.75
#> 10 10 b 1
df %>%
group_by(g) %>%
top_n(-4, y)
#> # A tibble: 8 x 2
#> # Groups: g [2]
#> y g
#> <int> <fct>
#> 1 1 a
#> 2 2 a
#> 3 3 a
#> 4 4 a
#> 5 6 b
#> 6 7 b
#> 7 8 b
#> 8 9 b
Created on 2018-07-03 by the reprex package (v0.2.0).
Upvotes: 0
Reputation: 47350
In base R you could do this:
df$y <- ave(df$y,df$g, FUN = function(y) (y - min(y))/(max(y) - min(y)))
# y g
# 1 0.00 a
# 2 0.25 a
# 3 0.50 a
# 4 0.75 a
# 5 1.00 a
# 6 0.00 b
# 7 0.25 b
# 8 0.50 b
# 9 0.75 b
# 10 1.00 b
Or this for the same effect:
split(df$y,df$g) <- tapply(df$y, df$g, function(y) (y - min(y))/(max(y) - min(y)))
A bit more flexible if you need to work in other variables of the data.frame:
by_ <- by(df, df$g, function(x) transform(x, y = (y - min(y))/(max(y) - min(y))))
do.call(rbind, by_)
# y g
# a.1 0.00 a
# a.2 0.25 a
# a.3 0.50 a
# a.4 0.75 a
# a.5 1.00 a
# b.6 0.00 b
# b.7 0.25 b
# b.8 0.50 b
# b.9 0.75 b
# b.10 1.00 b
Upvotes: 0
Reputation: 911
Using data.table:
library(data.table)
df=as.data.table(df)
df[,(y-min(y))/(max(y)-min(y)),by=g]
g V1
1: a 0.00
2: a 0.25
3: a 0.50
4: a 0.75
5: a 1.00
6: b 0.00
7: b 0.25
8: b 0.50
9: b 0.75
10: b 1.00
Upvotes: 2