user9831776
user9831776

Reputation:

How can I split a data frame, apply some function, and put it back together?

Often I have a data frame that contains a numerical variable and a categorical variable, and I want to split the numerical variable according to the categorical variable, perform some operation, and put it back together in the form of a data frame. The operation depends on the entire part of the numerical vector within a category, and will sometimes return a vector of a different length. I know how to do this in an ugly way (see example below), but it seems like a common operation so I'm wondering if there's a simpler way I'm not aware of. I would be especially interested to know if there is a solution using the tidyverse.

Here is an example of what I'm talking about.

df = data.frame(y=1:10, g=rep(c("a", "b"), each=5))

Say I want to standardize the variable y to be between 0 and 1 for each level of the categorical variable. Here is a general way to do it:

do.call(
    rbind,
    lapply(unique(df$g),
           function(level) {
               y.current = df$y[df$g==level]

               ## perform some operation
               y.new = (y.current-min(y.current))/
                   (max(y.current)-min(y.current))

               return(data.frame(y=y.new,
                                 g=level))
           }
           )
)

This requires a lot of typing and isn't very readable. Is there a better way?

Edit: Thanks for the great answers. The only thing I'm still interested in is a fully general method of doing this with the tidyverse. If we change the example to an operation where the size of the numerical vector is reduced but greater than one, the group_by/mutate/summarize combinations don't work. For example, say I want to remove the largest value within every group. I can do

library(dplyr)
df = data.frame(y=1:10, g=rep(c("a", "b"), each=5))
trans_df = df %>%
    group_by(g) %>%
    do(y=.$y[-which.max(.$y)])

The transformed data frame trans_df has the grouping variable with one observation per level, and the transformed variable as a list for each level of the grouping variable. I can put this in the original format using base R with

data.frame(g=rep(trans_df$g, times=sapply(trans_df$y, length)),
           y=do.call(c, trans_df$y))

but how can I do it using the tidyverse?

Upvotes: 2

Views: 591

Answers (3)

Calum You
Calum You

Reputation: 15072

This is classic split-apply-combine methodology. You group by a categorical variable, apply some function to the individual groups, and combine back together. In dplyr this is handled by group_by.

df <- data.frame(y=1:10, g=rep(c("a", "b"), each=5))

library(dplyr)
df %>%
  group_by(g) %>% 
  mutate(y2 = (y - min(y)) / (max(y) - min(y)))
#> # A tibble: 10 x 3
#> # Groups:   g [2]
#>        y g        y2
#>    <int> <fct> <dbl>
#>  1     1 a      0   
#>  2     2 a      0.25
#>  3     3 a      0.5 
#>  4     4 a      0.75
#>  5     5 a      1   
#>  6     6 b      0   
#>  7     7 b      0.25
#>  8     8 b      0.5 
#>  9     9 b      0.75
#> 10    10 b      1

df %>%
  group_by(g) %>% 
  top_n(-4, y)
#> # A tibble: 8 x 2
#> # Groups:   g [2]
#>       y g    
#>   <int> <fct>
#> 1     1 a    
#> 2     2 a    
#> 3     3 a    
#> 4     4 a    
#> 5     6 b    
#> 6     7 b    
#> 7     8 b    
#> 8     9 b

Created on 2018-07-03 by the reprex package (v0.2.0).

Upvotes: 0

moodymudskipper
moodymudskipper

Reputation: 47350

In base R you could do this:

df$y <- ave(df$y,df$g, FUN = function(y) (y - min(y))/(max(y) - min(y)))
#       y g
# 1  0.00 a
# 2  0.25 a
# 3  0.50 a
# 4  0.75 a
# 5  1.00 a
# 6  0.00 b
# 7  0.25 b
# 8  0.50 b
# 9  0.75 b
# 10 1.00 b

Or this for the same effect:

split(df$y,df$g) <- tapply(df$y, df$g, function(y) (y - min(y))/(max(y) - min(y)))

A bit more flexible if you need to work in other variables of the data.frame:

by_ <- by(df, df$g, function(x) transform(x, y = (y - min(y))/(max(y) - min(y))))
do.call(rbind, by_)
#         y g
# a.1  0.00 a
# a.2  0.25 a
# a.3  0.50 a
# a.4  0.75 a
# a.5  1.00 a
# b.6  0.00 b
# b.7  0.25 b
# b.8  0.50 b
# b.9  0.75 b
# b.10 1.00 b

Upvotes: 0

Using data.table:

library(data.table)
df=as.data.table(df)
df[,(y-min(y))/(max(y)-min(y)),by=g]
    g   V1
 1: a 0.00
 2: a 0.25
 3: a 0.50
 4: a 0.75
 5: a 1.00
 6: b 0.00
 7: b 0.25
 8: b 0.50
 9: b 0.75
10: b 1.00

Upvotes: 2

Related Questions