waealu
waealu

Reputation: 331

dplyr - mutate formula based on similarities in column names

I am trying to find a better way to run a mutate() on a combination of columns based on parts of the column names.

For example, a way to simplify the mutate function in the following code:

df <- data.frame(LIMITED_A = c(100,200),
                UNLIMITED_A = c(25000,50000),
                LIMITED_B = c(300,300),
                UNLIMITED_B = c(500,500),
                LIMITED_C = c(2,10),
                UNLIMITED_C = c(5,20))

df %>%
  mutate(FINAL_LIMITED = (LIMITED_A - LIMITED_B) / LIMITED_C,
         FINAL_UNLIMITED = (UNLIMITED_A - UNLIMITED_B) / UNLIMITED_C)

A formula with the form: (._A - ._B) / ._C and the result is given the name FINAL_.

Is there a way to simplify this to a single line of code in the mutate function?

Upvotes: 2

Views: 589

Answers (2)

acylam
acylam

Reputation: 18681

Here is a different approach:

library(dplyr)
library(rlang)
library(glue)

dynamic_mutate = function(DF,  
                          col_names = gsub("(.*)_\\w+$", "\\1", names(DF)), 
                          expression = "({x}_A - {x}_B)/{x}_C",
                          prefix = "FINAL"){

  name_list = col_names %>% 
    unique() %>%
    as.list()

  expr_list = name_list %>%
    lapply(function(x) parse_quosure(glue(expression))) %>% 
    setNames(paste(prefix, name_list, sep = "_")) 

  DF %>% mutate(!!!expr_list)

}

Result:

> df %>%
+   dynamic_mutate()
  LIMITED_A UNLIMITED_A LIMITED_B UNLIMITED_B LIMITED_C UNLIMITED_C FINAL_LIMITED
1       100       25000       300         500         2           5          -100
2       200       50000       300         500        10          20           -10
  FINAL_UNLIMITED
1            4900
2            2475

> df %>%
+   dynamic_mutate(c("LIMITED", "UNLIMITED"), prefix = "NEW")
  LIMITED_A UNLIMITED_A LIMITED_B UNLIMITED_B LIMITED_C UNLIMITED_C NEW_LIMITED
1       100       25000       300         500         2           5        -100
2       200       50000       300         500        10          20         -10
  NEW_UNLIMITED
1          4900
2          2475

> df %>%
+   dynamic_mutate(c("UNLIMITED"), prefix = "NEW")
  LIMITED_A UNLIMITED_A LIMITED_B UNLIMITED_B LIMITED_C UNLIMITED_C NEW_UNLIMITED
1       100       25000       300         500         2           5          4900
2       200       50000       300         500        10          20          2475

> df %>% 
+   dynamic_mutate(c("A", "B", "C"), "LIMITED_{x} + UNLIMITED_{x}")
  LIMITED_A UNLIMITED_A LIMITED_B UNLIMITED_B LIMITED_C UNLIMITED_C FINAL_A FINAL_B FINAL_C
1       100       25000       300         500         2           5   25100     800       7
2       200       50000       300         500        10          20   50200     800      30

Notes:

This approach uses lapply and glue to construct expressions from prefixes extracted using gsub (or you can supply your own prefixes/suffixes). parse_quosure from rlang is then used to parse the expression into a quosure. As a result, expr_list is a named list of quosure's which I can then use !!! to unquote and splice the arguments into separate expressions in mutate.

You can change the formula by adjusting the expression argument as shown in the last example.

The advantage of this method is that it is quite fast because I am mainly manipulating column names and creating strings (expressions). The disadvantage is that it uses multiple packages.

Upvotes: 2

www
www

Reputation: 39154

One idea is to convert the data frame to long format and conduct the calculation.

library(dplyr)
library(tidyr)

df2 <- df %>%
  mutate(ID = 1:n()) %>%
  gather(Type, Value, -ID) %>%
  separate(Type, into = c("Type", "Group")) %>%
  spread(Group, Value) %>%
  mutate(Final = (A - B)/C)
df2
  ID      Type     A   B  C Final
1  1   LIMITED   100 300  2  -100
2  1 UNLIMITED 25000 500  5  4900
3  2   LIMITED   200 300 10   -10
4  2 UNLIMITED 50000 500 20  2475

And you can always convert the data frame back to the wide format.

df3 <- df2 %>%
  gather(Group, Value, A:Final) %>%
  unite(Col, Type, Group) %>%
  spread(Col, Value) %>%
  select(colnames(df), 
         FINAL_LIMITED = LIMITED_Final, 
         FINAL_UNLIMITED = UNLIMITED_Final)
  LIMITED_A UNLIMITED_A LIMITED_B UNLIMITED_B LIMITED_C UNLIMITED_C FINAL_LIMITED FINAL_UNLIMITED
1       100       25000       300         500         2           5          -100            4900
2       200       50000       300         500        10          20           -10            2475

Upvotes: 2

Related Questions