Mutate inside a function: object not found

Question

I have a simple dataset that looks like this:

I want to transform the price variables into numeric, which I understand requires first cleaning out the € symbol and changing the comma into a dot. I have done this manually for the price_new variable and it works as intended:

 sp_merrel_df_clean <- sp_merrel_df_raw %>%
    mutate(
      price_new = sub(" €", "", price_new),
      price_new = sub(",", ".", price_new),
      price_new = as.numeric(price_new)
    )

However, as I'm new to R, I wanted to turn this into a more general function for practice. So I used the exact same code as above, but replacing the specific inputs with generic placeholders - presumably it would work, since it's the exact same code...

  currency_to_numeric <- function(raw, clean, var) {
    clean <- raw %>%
      mutate(
        var = sub(" €", "", var),
        var = sub(",", ".", var),
        var = as.numeric(var))
  }

...but instead, when calling:

currency_to_numeric(raw = sp_merrel_df_raw, clean = sp_merrel_df_clean2, var = price_new)

I get this error:

Error: Problem with `mutate()` column `var`.
i `var = sub(" \200", "", var)`.
x object 'price_new' not found

Apologies for what is probably a very basic question, and thank you in advance!

Greg · Accepted Answer

Below, I have included a very ergonomic Solution, which (almost) seamlessly mimics the familiar feel of the dplyr workflow. I have also spent some time in diagnosing and addressing conceptual pitfalls.

A common source of confusion is the fact that R passes by value and not by reference. Along with the nuances of programmatic dplyr, this fact is responsible for two conceptual errors in your code.

For the sake of convenience, I have reproduced your sp_merrel_df_raw here as a data.frame:

structure(list(product_name = c("Merrell Riverbed 3", "Sapatilhas Montanha Merrell", "Merrell Moab Adventure", "Merrell Moab 2 Vent"),
               price_new = c("59,99 €", "149,99 €", "99,99 €", "79,99 €"),
               price_old = c("69,99 €", NA, NA, "99,99 €"),
               date = c(210720, 210720, 210720, 210720)),
          row.names = c(NA, -4L),
          class = "data.frame")

You Cannot Modify "in Place"

Serendipitously, I clarified this exact confusion some time ago. Suppose you have a simple numeric variable a and a simple function doubler():

x <- 2

doubler <- function(num) {
  num <- 2 * num
}

Now simply running doubler(x) will do absolutely nothing aside from (invisibly) returning 4. All that happens is that the parameter num is passed the value 2, and then num is overwritten with 4 within the scope of the function. However, the original variable x remains untouched:

doubler(x)

x
# [1] 2

In order to modify x, we must overwrite it (<-) with the results of the function:

x <- doubler(x)

x
# [1] 4

Analogously, when you run your currency_to_numeric() function

currency_to_numeric(raw = sp_merrel_df_raw, clean = sp_merrel_df_clean2, var = price_new)

it will accept the value of sp_merrel_df_clean2, and assign that value to its clean parameter. Everything that happens afterward

    clean <- raw %>% # ...

affects only clean within the scope of the function. When all is said and done, sp_merrel_df_clean2 will never be affected.

Instead, something like this is required, to overwrite sp_merrel_df_clean2 with the new value:

currency_to_numeric <- function(raw, ...) {
  # ...
}

sp_merrel_df_clean2 <- currency_to_numeric(raw = sp_merrel_df_raw, ...)

Decontextualized Variables

As discussed in the dplyr documentation

env-variables are “programming” variables that live in an environment

whereas

data-variables are “statistical” variables that live in a data frame.

Now data-variables are "masked" in the context of certain functions, especially in dplyr. Such masking lets us refer to the sp_merrel_df_raw$price_new column as simply price_new when we perform (say) a mutate() on sp_merrel_df_raw:

sp_merrel_df_raw %>%
  mutate(
    price_new = sub(" €", "", price_new)
    # ...
  )

However, when you run your currency_to_numeric() function

currency_to_numeric(raw = sp_merrel_df_raw, clean = sp_merrel_df_clean2, var = price_new)

var does not become the sp_merrel_df_raw$price_new variable itself.

Rather, R looks for some env-variable named price_new, in surrounding environment (here .GlobalEnv), and attempts to assign its value to the var parameter. Naturally, since no such price_new variable exists in .GlobalEnv, there is no such value, so R throws an error as soon as it tries to use that value in mutate():

Error: Problem with `mutate()` column `var`.
i `var = sub(" \200", "", var)`.
x object 'price_new' not found

This error is comparable to what you would get if you called a function on any other variable that didn't exist:

doubler(num = nonexistent_variable)

# Error in doubler(num = nonexistent_variable) : 
#   object 'nonexistent_variable' not found

However, even if price_new were actually floating around in .GlobalEnv as a typical env-variable, you would still get an error. This is because passing the value of price_new to var is not the same as "pasting" the "name" price_new

mutate(
        price_new = sub(" €", "", price_new),
        price_new = sub(",", ".", price_new),
        price_new = as.numeric(price_new))

wherever the "name" var used to be.

mutate(
        var = sub(" €", "", var),
        var = sub(",", ".", var),
        var = as.numeric(var))

Solution

Here's a nifty reworking of currency_to_numeric () that closely imitates the typical functionality of dplyr:

currency_to_numeric <- function(raw, ...) {
  raw %>%
    mutate(
      across(c(...), ~ sub(" €", "", .x)),
      across(c(...), ~ sub(",", ".", .x)),
      across(c(...), ~ as.numeric(.x))
      )
}

As with virtually any R function, you must still assign the results to sp_merrel_df_clean2, but this solution will help you do so very cleanly

sp_merrel_df_clean2 <- sp_merrel_df_raw %>%
  currency_to_numeric(price_new)

with the following results for sp_merrel_df_clean2:

                 product_name price_new price_old   date
1          Merrell Riverbed 3     59.99   69,99 € 210720
2 Sapatilhas Montanha Merrell    149.99       210720
3      Merrell Moab Adventure     99.99       210720
4         Merrell Moab 2 Vent     79.99   99,99 € 210720

In fact, you can simultaneously target as many data-variables (like price_new and price_old) as you want

sp_merrel_df_clean2 <- sp_merrel_df_raw %>%
  currency_to_numeric(price_new, price_old)

and covert all your currency columns in one fell swoop!

                 product_name price_new price_old   date
1          Merrell Riverbed 3     59.99     69.99 210720
2 Sapatilhas Montanha Merrell    149.99        NA 210720
3      Merrell Moab Adventure     99.99        NA 210720
4         Merrell Moab 2 Vent     79.99     99.99 210720

Mutate inside a function: object not found

Answers (2)

You Cannot Modify "in Place"

Decontextualized Variables

Solution

Related Questions