Let dplyr mutate use formula

Question

I have a large dataset stored in a long dataframe. I would like to extract data on some variables and use a formula to generate new data. All the necessary information should be extracted from the formula. First, I want to use the information in the formula to filter the dataset for the according variables - I use the all.vars() function for that. I also rely on the formula.tools package, which is on CRAN. It is used for easy extraction of the left and right hand side of the equation (lhsand rhs, respectively).

library(dplyr)
library(reshape2)
library(formula.tools)

set.seed(100)

the_data <- data.frame(country = c(rep("USA", 9), rep("DEU", 9), rep("CHN", 9)),
                       year    = c(2000, 2010, 2020),
                       variable = c(rep("GDP", 3), rep("Population", 3), rep("Consumption", 3)),
                       value = rnorm(27, 100, 100))

add_variable <- function(df, equation){
  df <- filter(df, variable %in% all.vars(equation))

  df <- dcast(df, country + year ~ variable)

  df <- mutate_(df, rhs(equation))

  # code to keep only the newly generated column
  # ...

  df <- melt(df, id.vars = c("country", "year"))
}

result <- add_variable(the_data, GDPpC ~ GDP / Population)

The newly generated column should be named GDPpC, currently it is called GDP/Population. How can this be improved? In a final step I would like to also filter the data so that only the newly generated data is contained in the result, which can then be attached to the source dataframe via rbind.

Cabana · Accepted Answer

Would that be a solution ?

add_variable <- function(df, equation){
      df <- filter(df, variable %in% all.vars(equation))
      orig_vars <- unique(df$variable)
      df <- dcast(df, country + year ~ variable)

      df <- mutate_(df, rhs(equation))
      colnames(df)[ncol(df)] <- as.character(lhs(equation))

      df <- melt(df, id.vars = c("country", "year"))
      df <- filter(df, !variable%in%orig_vars)
    }

    result <- add_variable(the_data, GDPpC ~ GDP / Population)
    result
  country year variable      value
1     CHN 2000    GDPpC 0.04885649
2     CHN 2010    GDPpC 2.62313658
3     CHN 2020    GDPpC 0.31685382
4     DEU 2000    GDPpC 0.80180998
5     DEU 2010    GDPpC 0.62642877
6     DEU 2020    GDPpC 0.97587188
7     USA 2000    GDPpC 0.26383912
8     USA 2010    GDPpC 1.01303516
9     USA 2020    GDPpC 0.69851501

Let dplyr mutate use formula

Answers (2)

Related Questions