Reputation: 1205
I have a large dataset stored in a long dataframe. I would like to extract data on some variables and use a formula to generate new data. All the necessary information should be extracted from the formula. First, I want to use the information in the formula to filter the dataset for the according variables - I use the all.vars()
function for that. I also rely on the formula.tools
package, which is on CRAN. It is used for easy extraction of the left and right hand side of the equation (lhs
and rhs
, respectively).
library(dplyr)
library(reshape2)
library(formula.tools)
set.seed(100)
the_data <- data.frame(country = c(rep("USA", 9), rep("DEU", 9), rep("CHN", 9)),
year = c(2000, 2010, 2020),
variable = c(rep("GDP", 3), rep("Population", 3), rep("Consumption", 3)),
value = rnorm(27, 100, 100))
add_variable <- function(df, equation){
df <- filter(df, variable %in% all.vars(equation))
df <- dcast(df, country + year ~ variable)
df <- mutate_(df, rhs(equation))
# code to keep only the newly generated column
# ...
df <- melt(df, id.vars = c("country", "year"))
}
result <- add_variable(the_data, GDPpC ~ GDP / Population)
The newly generated column should be named GDPpC
, currently it is called GDP/Population
. How can this be improved? In a final step I would like to also filter the data so that only the newly generated data is contained in the result, which can then be attached to the source dataframe via rbind
.
Upvotes: 2
Views: 2714
Reputation: 3923
I wound up here many years later while looking for information on using a formula inside a dplyr::mutate
, since I often find that to be much more concise and clear. dplyr
has of course grown and changed since 2016, including the fact that transmute
is now deprecated. But the good news is that if you're willing to use formula.tools
the solution is very concise. as shown below.
library(dplyr)
# reproducible play_data
set.seed(2020)
play_data <-
data.frame(
a = runif(20, 0.01, .5),
b = runif(20, 0.02, .5),
c = runif(20, 0.03, .5),
d = runif(20, 0.04, .5),
e = runif(20,1,5),
f = runif(20,10,50)
)
my_formula <- newvariable ~ a * b^c / d * log(e) - f
require(formula.tools)
mutate_by_formula <- function(df, equation){
df %>% transmute( !!lhs(equation) := !!rhs(equation) )
}
mutate_by_formula(play_data, my_formula)
#> newvariable
#> 1 -25.80405
#> 2 -20.48974
#> 3 -37.87361
#> 4 -46.52231
#> 5 -19.88420
#> 6 -16.49153
#> 7 -37.25498
#> 8 -41.02025
#> 9 -31.88338
#> 10 -42.17896
#> 11 -30.75905
#> 12 -10.42447
#> 13 -25.84538
#> 14 -46.08206
#> 15 -13.51940
#> 16 -25.30124
#> 17 -19.80536
#> 18 -26.42881
#> 19 -38.02190
#> 20 -30.51113
As to OP's original example and specifics, it's slightly more complex since the data has to be shaped but the basic notion is the same. The only twists are the dcast
and the final select
to remove variables that were used in the calculations.
library(dplyr)
library(reshape2)
library(formula.tools)
set.seed(100)
the_data <- data.frame(country = c(rep("USA", 9), rep("DEU", 9), rep("CHN", 9)),
year = c(2000, 2010, 2020),
variable = c(rep("GDP", 3), rep("Population", 3), rep("Consumption", 3)),
value = rnorm(27, 100, 100))
specific_function <- function(df, equation){
df %>%
filter(variable %in% all.vars(equation)) %>%
dcast(country + year ~ variable) %>%
mutate(!!lhs(equation) := !!rhs(equation)) %>%
select(-all.vars(equation)[2:length(all.vars(equation))])
}
specific_function(the_data, GDPpC ~ GDP / Population)
#> country year GDPpC
#> 1 CHN 2000 0.04885649
#> 2 CHN 2010 2.62313658
#> 3 CHN 2020 0.31685382
#> 4 DEU 2000 0.80180998
#> 5 DEU 2010 0.62642877
#> 6 DEU 2020 0.97587188
#> 7 USA 2000 0.26383912
#> 8 USA 2010 1.01303516
#> 9 USA 2020 0.69851501
Created on 2020-05-04 by the reprex package (v0.3.0)
Upvotes: 0
Reputation: 419
Would that be a solution ?
add_variable <- function(df, equation){
df <- filter(df, variable %in% all.vars(equation))
orig_vars <- unique(df$variable)
df <- dcast(df, country + year ~ variable)
df <- mutate_(df, rhs(equation))
colnames(df)[ncol(df)] <- as.character(lhs(equation))
df <- melt(df, id.vars = c("country", "year"))
df <- filter(df, !variable%in%orig_vars)
}
result <- add_variable(the_data, GDPpC ~ GDP / Population)
result
country year variable value
1 CHN 2000 GDPpC 0.04885649
2 CHN 2010 GDPpC 2.62313658
3 CHN 2020 GDPpC 0.31685382
4 DEU 2000 GDPpC 0.80180998
5 DEU 2010 GDPpC 0.62642877
6 DEU 2020 GDPpC 0.97587188
7 USA 2000 GDPpC 0.26383912
8 USA 2010 GDPpC 1.01303516
9 USA 2020 GDPpC 0.69851501
Upvotes: 3