roming
roming

Reputation: 1205

Let dplyr mutate use formula

I have a large dataset stored in a long dataframe. I would like to extract data on some variables and use a formula to generate new data. All the necessary information should be extracted from the formula. First, I want to use the information in the formula to filter the dataset for the according variables - I use the all.vars() function for that. I also rely on the formula.tools package, which is on CRAN. It is used for easy extraction of the left and right hand side of the equation (lhsand rhs, respectively).

library(dplyr)
library(reshape2)
library(formula.tools)

set.seed(100)

the_data <- data.frame(country = c(rep("USA", 9), rep("DEU", 9), rep("CHN", 9)),
                       year    = c(2000, 2010, 2020),
                       variable = c(rep("GDP", 3), rep("Population", 3), rep("Consumption", 3)),
                       value = rnorm(27, 100, 100))

add_variable <- function(df, equation){
  df <- filter(df, variable %in% all.vars(equation))

  df <- dcast(df, country + year ~ variable)

  df <- mutate_(df, rhs(equation))

  # code to keep only the newly generated column
  # ...

  df <- melt(df, id.vars = c("country", "year"))
}

result <- add_variable(the_data, GDPpC ~ GDP / Population)

The newly generated column should be named GDPpC, currently it is called GDP/Population. How can this be improved? In a final step I would like to also filter the data so that only the newly generated data is contained in the result, which can then be attached to the source dataframe via rbind.

Upvotes: 2

Views: 2714

Answers (2)

Chuck P
Chuck P

Reputation: 3923

I wound up here many years later while looking for information on using a formula inside a dplyr::mutate, since I often find that to be much more concise and clear. dplyr has of course grown and changed since 2016, including the fact that transmute is now deprecated. But the good news is that if you're willing to use formula.tools the solution is very concise. as shown below.

library(dplyr)

# reproducible play_data

set.seed(2020)
play_data <- 
  data.frame(
    a = runif(20, 0.01, .5),
    b = runif(20, 0.02, .5),
    c = runif(20, 0.03, .5),
    d = runif(20, 0.04, .5),
    e = runif(20,1,5),
    f = runif(20,10,50)
  )

my_formula <- newvariable ~ a * b^c / d * log(e) - f

require(formula.tools)

mutate_by_formula <- function(df, equation){
  df %>% transmute( !!lhs(equation) := !!rhs(equation) )
}

mutate_by_formula(play_data, my_formula)
#>    newvariable
#> 1    -25.80405
#> 2    -20.48974
#> 3    -37.87361
#> 4    -46.52231
#> 5    -19.88420
#> 6    -16.49153
#> 7    -37.25498
#> 8    -41.02025
#> 9    -31.88338
#> 10   -42.17896
#> 11   -30.75905
#> 12   -10.42447
#> 13   -25.84538
#> 14   -46.08206
#> 15   -13.51940
#> 16   -25.30124
#> 17   -19.80536
#> 18   -26.42881
#> 19   -38.02190
#> 20   -30.51113

As to OP's original example and specifics, it's slightly more complex since the data has to be shaped but the basic notion is the same. The only twists are the dcast and the final select to remove variables that were used in the calculations.

library(dplyr)
library(reshape2)
library(formula.tools)

set.seed(100)

the_data <- data.frame(country = c(rep("USA", 9), rep("DEU", 9), rep("CHN", 9)),
                       year    = c(2000, 2010, 2020),
                       variable = c(rep("GDP", 3), rep("Population", 3), rep("Consumption", 3)),
                       value = rnorm(27, 100, 100))

specific_function <- function(df, equation){
  df %>% 
    filter(variable %in% all.vars(equation)) %>%
    dcast(country + year ~ variable) %>%
    mutate(!!lhs(equation) := !!rhs(equation)) %>%
    select(-all.vars(equation)[2:length(all.vars(equation))])
}

specific_function(the_data, GDPpC ~ GDP / Population)
#>   country year      GDPpC
#> 1     CHN 2000 0.04885649
#> 2     CHN 2010 2.62313658
#> 3     CHN 2020 0.31685382
#> 4     DEU 2000 0.80180998
#> 5     DEU 2010 0.62642877
#> 6     DEU 2020 0.97587188
#> 7     USA 2000 0.26383912
#> 8     USA 2010 1.01303516
#> 9     USA 2020 0.69851501

Created on 2020-05-04 by the reprex package (v0.3.0)

Upvotes: 0

Cabana
Cabana

Reputation: 419

Would that be a solution ?

add_variable <- function(df, equation){
      df <- filter(df, variable %in% all.vars(equation))
      orig_vars <- unique(df$variable)
      df <- dcast(df, country + year ~ variable)

      df <- mutate_(df, rhs(equation))
      colnames(df)[ncol(df)] <- as.character(lhs(equation))

      df <- melt(df, id.vars = c("country", "year"))
      df <- filter(df, !variable%in%orig_vars)
    }

    result <- add_variable(the_data, GDPpC ~ GDP / Population)
    result
  country year variable      value
1     CHN 2000    GDPpC 0.04885649
2     CHN 2010    GDPpC 2.62313658
3     CHN 2020    GDPpC 0.31685382
4     DEU 2000    GDPpC 0.80180998
5     DEU 2010    GDPpC 0.62642877
6     DEU 2020    GDPpC 0.97587188
7     USA 2000    GDPpC 0.26383912
8     USA 2010    GDPpC 1.01303516
9     USA 2020    GDPpC 0.69851501

Upvotes: 3

Related Questions