arnyeinstein
arnyeinstein

Reputation: 1013

using mutate_at from dplyr

I have a data frame with 5 columns and I want to produce 4 additional columns giving my the difference between the last 4 columns and the first column. I tried the following, but that doesn't work:

library(tidyverse)
df <- as.tibble(data.frame(A = c(1,2), B = c(3,4), C = c(4,5), D = c(2,3), E = c(4,5)))
r_diff <- function(x,y){
  z = y - x
  return(z)
}
vars_to_process <- c("B","C","D","E")
df %>% mutate_at(.cols=vars_to_process, .funs =r_diff(.,df[,1])) %>% head()

Thanks Renger

Upvotes: 4

Views: 10918

Answers (3)

bschneidr
bschneidr

Reputation: 6278

Here's the simplest way to do it.

df %>% 
   mutate_at(.vars = vars(B:E),
             .funs = list(~ . - A))

The .vars argument lets you specify columns in the same way that you would specify columns in select(), provided you put that specification inside the function vars().

The .funs argument accepts an anonymous function defined on the fly inside a call to list(). And you can reference a column in the dataframe (in this case A) when defining this anonymous function (see this Stackoverflow question).

In addition, with the release of dplyr 1.0.0, you can now simply do the following:

df %>%
   mutate(across(B:E, ~ . - A))

Upvotes: 9

Spacedman
Spacedman

Reputation: 94162

Here's a faster solution using base R code. Strategy is convert to a matrix, subtract column one from the required columns, build back into a data frame. Note this only returns the modified columns - if there are columns not in vars_to_process they'll not appear in the output but you didn't have any of those in your test set so I'll assume they don't exist.

So, always write things in functions whenever possible:

bsr = function(df,vars_to_process){
    m = as.matrix(df)
    data.frame(
         A = m[, 1],
             m[, 1] - m[, vars_to_process])}

Make some test data:

> df = data.frame(matrix(runif(5*1000), ncol=5))
> names(df)=LETTERS[1:5]
> dft = as.tibble(df)
> head(dft)
# A tibble: 6 x 5
          A          B         C         D         E
      <dbl>      <dbl>     <dbl>     <dbl>     <dbl>
1 0.2609174 0.07857624 0.2727817 0.8498004 0.3403234
2 0.3644744 0.95810657 0.8183856 0.2958133 0.4752349
3 0.6042914 0.98793218 0.7547003 0.9596591 0.5354045
4 0.4000441 0.61403331 0.9018804 0.3838347 0.3266855
5 0.6767012 0.11984219 0.9181570 0.5988404 0.6058629

Compare with the tidyverse version:

akr = function(df,vars_to_process){
   df %>% mutate_at(vars_to_process, funs(r_diff(.,df[[1]])))
   }

Check bsr and akr agree:

> head(bsr(dft, vars_to_process))
          A          B           C           D           E
1 0.2609174  0.1823412 -0.01186432 -0.58888295 -0.07940594
2 0.3644744 -0.5936322 -0.45391119  0.06866108 -0.11076050
3 0.6042914 -0.3836408 -0.15040892 -0.35536765  0.06888696
4 0.4000441 -0.2139892 -0.50183635  0.01620939  0.07335861

> head(akr(dft, vars_to_process))
# A tibble: 6 x 5
          A          B           C           D           E
      <dbl>      <dbl>       <dbl>       <dbl>       <dbl>
1 0.2609174  0.1823412 -0.01186432 -0.58888295 -0.07940594
2 0.3644744 -0.5936322 -0.45391119  0.06866108 -0.11076050
3 0.6042914 -0.3836408 -0.15040892 -0.35536765  0.06888696
4 0.4000441 -0.2139892 -0.50183635  0.01620939  0.07335861

okay, except akr returns a tribble but nm. Benchmark:

> microbenchmark(bsr(dft, vars_to_process),akr(dft, vars_to_process))
Unit: microseconds
                      expr      min        lq      mean   median       uq
 bsr(dft, vars_to_process)  362.117  388.7215  488.9309  446.123  521.776
 akr(dft, vars_to_process) 8070.391 8365.4230 9853.5239 8673.692 9335.613

Base R version is 26 times faster. I'd also argue that subtracting a column from another set of columns is tidier than applying a mutator function but as long as you wrap what your doing in a function it doesn't matter how messy the guts are.

Upvotes: 5

akrun
akrun

Reputation: 886928

We need to subset the column with [[ as the [ is still a data.frame

df %>% 
   mutate_at(vars_to_process, funs(r_diff(.,df[[1]]))) 
# A tibble: 2 x 5
#     A     B     C     D     E
#  <dbl> <dbl> <dbl> <dbl> <dbl>
#1     1    -2    -3    -1    -3
#2     2    -2    -3    -1    -3

Upvotes: 4

Related Questions