Reputation: 3923
After a lot of trial and error and consultation with previous answers such as How to detect if bare variable or string I think I have gotten most of what I need done myself. But I'm eager to understand if I'm making some bad assumptions or approaching the problem foolishly before I carry my "solution" into production.
Consider the following data:
library(dplyr)
library(purrr)
library(tidyselect)
set.seed(1111)
dat1 <- data.frame(Region = rep(c("r1","r2"), each = 100),
State = rep(c("NY","MA","FL","GA"), each = 10),
Loc = rep(c("a","b","c","d","e","f","g","h"),each = 5),
ID = rep(c(1:10), each = 2),
var1 = rnorm(200),
var2 = rnorm(200),
var3 = rnorm(200),
var4 = rnorm(200),
var5 = rnorm(200))
I want to write a function that does quite a few things but I'll start with a minimum reproducible example. I want to get tidied
aov
results back either for a singular case var1 ~ State
or for a pair of matched lists using map2
with one list containing "outcomes" the other "predictors". They're never identical from use to use and the variables, unlike my example, rarely lend themselves to easy solutions like starts_with
.
Two specific issues and a generic question.
Issue #1 - I've given up on allowing end users (including me) to pass in bare variable names always gets me in trouble later. In accordance with the reference above is something like my code the fastest most reliable way to catch them and tell the user? (I put a comment in the code to indicate where I'm talking about.
Issue #2 - Through basically trail and error I think I solved my other problem which is in generating some text for use later as a label. I found lots of solutions when I'm not using the function with map2
but only this one seems to work with map2. It seems so convoluted I can't believe it's a good choice... (again comments in code to show where)
Generic question: I've added the recommended tidyselect::all_of
because these might be ambiguous lists, why am I still having to guard against the .x
and .y
being seen as calls as opposed to just markers for iteration?
MyFunction <- function(data,
groupvar,
var) {
# Issue #1 is this best way to warn/stop user?
lst <- as.list(match.call())
if (is.symbol(lst$groupvar) || is.symbol(lst$var)) {
stop("Please quote all variables")
}
# Issue #2 I want the group label but if I don't include
# this if logic it errors with " Error: Can't convert a call to a string"
# when I run it with purrr::map2
if (!is.call(groupvar)) {
grouplabel <- rlang::as_name(rlang::enquo(groupvar))
}
data <-
dplyr::select(
.data = data,
var = {{ var }},
groupvar = {{ groupvar }}
)
aov_object <- aov(var ~ groupvar, data = data)
aov_results <- broom::tidy(aov_object) %>%
mutate(term = if_else(term != "Residuals", grouplabel, term))
return(aov_results)
}
# Expected output
MyFunction(data = dat1, groupvar = "State", var = "var1") # works
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 State 3 1.75 0.582 0.485 0.693
#> 2 Residuals 196 235. 1.20 NA NA
MyFunction(data = dat1, groupvar = State, var = var1) # warns appropriately
#> Error in MyFunction(data = dat1, groupvar = State, var = var1): Please quote all variables
# Quick test of `map2`
grouping_vars <- names(dat1[,1:3])
names(grouping_vars) <- names(dat1[,1:3])
outcome_vars <- names(dat1[,5:7])
names(outcome_vars) <- names(dat1[,5:7])
names(outcome_vars) <- paste(outcome_vars, "~", grouping_vars)
# get multiple results this is where issue #2 comes in but this is what I want it to look like.
map2(.x = outcome_vars,
.y = grouping_vars,
.f = ~ MyFunction(dat = dat1,
var = tidyselect::all_of(.x),
groupvar = tidyselect::all_of(.y)))
#> $`var1 ~ Region`
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Region 1 0.0512 0.0512 0.0427 0.836
#> 2 Residuals 198 237. 1.20 NA NA
#>
#> $`var2 ~ State`
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 State 3 5.05 1.68 2.07 0.106
#> 2 Residuals 196 159. 0.814 NA NA
#>
#> $`var3 ~ Loc`
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Loc 7 5.09 0.727 0.772 0.612
#> 2 Residuals 192 181. 0.943 NA NA
Upvotes: 4
Views: 317
Reputation: 174576
It seems to me that since you are insistent on passing strings as variable names it would be simpler and more efficient to change the formula to match the variables using as.formula
rather than changing the data. This also prevents you having to separately name the grouping variable inside the function.
The following function is shorter and about twice as fast in benchmarking as the original, but the behaviour remains unchanged:
MyFunctionNew <- function(data, groupvar, var)
{
lst <- as.list(match.call())
if (is.symbol(lst$groupvar) || is.symbol(lst$var))
stop("Please quote all variables")
broom::tidy(aov(as.formula(paste(var, "~", groupvar)), data = data)) %>%
mutate(term = if_else(term != "Residuals", groupvar, term))
}
You can see that it still works inside map2
:
map2(.x = outcome_vars,
.y = grouping_vars,
.f = ~ MyFunctionNew(dat = dat1,
var = tidyselect::all_of(.x),
groupvar = tidyselect::all_of(.y)))
#> $`var1 ~ Region`
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Region 1 0.0512 0.0512 0.0427 0.836
#> 2 Residuals 198 237. 1.20 NA NA
#>
#> $`var2 ~ State`
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 State 3 5.05 1.68 2.07 0.106
#> 2 Residuals 196 159. 0.814 NA NA
#>
#> $`var3 ~ Loc`
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Loc 7 5.09 0.727 0.772 0.612
#> 2 Residuals 192 181. 0.943 NA NA
In terms of screening variables to ensure they are character strings, I don't think this is idiomatic R usage, and could cause some confusion to casual users of your function. In other words, it violates the principle of least astonishment.
For example, as a naive user, I would expect to be able to specify the grouping variable programatically like this:
MyVar <- "State"
MyFunction(data = dat1, groupvar = MyVar, var = "var1")
However, I get an error telling me that all variables should be quoted.
This also means that your function won't work within base R loops and *apply
functions:
lapply(c("State", "Region", "ID"), function(x) MyFunction(dat1, x, "var1"))
#> Error in MyFunction(dat1, x, "var1") : Please quote all variables
I think this is far more confusing and limiting than just allowing an error to be thrown when an unquoted column name is used. Therefore, I think your production function should be something like:
MyFunction <- function(data, groupvar, var)
{
broom::tidy(aov(as.formula(paste(var, "~", groupvar)), data = data)) %>%
mutate(term = if_else(term != "Residuals", groupvar, term))
}
Which performs like this:
MyFunction(data = dat1, groupvar = "State", var = "var1")
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 State 3 1.75 0.582 0.485 0.693
#> 2 Residuals 196 235. 1.20 NA NA
MyFunction(data = dat1, groupvar = MyVar, var = "var1")
#> # A tibble: 2 x 6
#> term df sumsq meansq statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 State 3 1.75 0.582 0.485 0.693
#> 2 Residuals 196 235. 1.20 NA NA
MyFunction(data = dat1, groupvar = State, var = var1)
#> Error in paste(var, "~", groupvar) : object 'State' not found
I think most R users would realise why they were getting this last error, since it is pretty clear. It is also an error that regular R users will have seen many times. If you have less faith in your users, perhaps you could try wrapping the function body in a tryCatch
that converts a "symbol not found error" to a "please use quotes" error.
Ultimately, it may be best to write the function so that it takes naked symbols, but I get the impression you are keen to avoid that and so I won't labour the point here.
Upvotes: 4
Reputation: 21349
I have resolved issue #1. Your function works whether the variable names are quoted or not.
MyFunction <- function(data,
groupvar,
var) {
# Issue #1 is this best way to warn/stop user?
lst <- as.list(match.call())
if (is.symbol(lst$groupvar)) {
q <- paste0("groupvar")
varname <- expr('$'(lst,!!q))
gval <- eval_tidy(varname)
groupvarc <- as.character(gval)
}else{groupvarc <- eval_tidy(lst$groupvar)}
if (is.symbol(lst$var)) {
v <- paste0("var")
varnam <- expr('$'(lst,!!v))
vval <- eval_tidy(varnam)
varc <- as.character(vval)
}else{varc <- eval_tidy(lst$var)}
grouplabel <- groupvarc[1]
data <- dplyr::select(.data = data,
var = varc[[1]],
groupvar = groupvarc[[1]] )
aov_object <- aov(var ~ groupvar, data = data)
aov_results <- broom::tidy(aov_object) %>%
mutate(term = if_else(term != "Residuals", grouplabel, term))
return(aov_results)
}
MyFunction(data = dat1, groupvar = "State", var = "var1") # works
MyFunction(data = dat1, groupvar = State, var = var1) # Also works
For multiple variables you will need to make it a function and cycle it through lapply
. Also, it will tidy up my repeating the same code two times for issue #1. I hope this helps you to move forward.
Upvotes: 1