Reputation: 35
I have a simple dataset that looks like this:
I want to transform the price variables into numeric, which I understand requires first cleaning out the € symbol and changing the comma into a dot. I have done this manually for the price_new variable and it works as intended:
sp_merrel_df_clean <- sp_merrel_df_raw %>%
mutate(
price_new = sub(" €", "", price_new),
price_new = sub(",", ".", price_new),
price_new = as.numeric(price_new)
)
However, as I'm new to R, I wanted to turn this into a more general function for practice. So I used the exact same code as above, but replacing the specific inputs with generic placeholders - presumably it would work, since it's the exact same code...
currency_to_numeric <- function(raw, clean, var) {
clean <- raw %>%
mutate(
var = sub(" €", "", var),
var = sub(",", ".", var),
var = as.numeric(var))
}
...but instead, when calling:
currency_to_numeric(raw = sp_merrel_df_raw, clean = sp_merrel_df_clean2, var = price_new)
I get this error:
Error: Problem with `mutate()` column `var`.
i `var = sub(" \200", "", var)`.
x object 'price_new' not found
Apologies for what is probably a very basic question, and thank you in advance!
Upvotes: 1
Views: 5031
Reputation: 3326
Below, I have included a very ergonomic Solution, which (almost) seamlessly mimics the familiar feel of the dplyr
workflow. I have also spent some time in diagnosing and addressing conceptual pitfalls.
A common source of confusion is the fact that R passes by value and not by reference. Along with the nuances of programmatic dplyr
, this fact is responsible for two conceptual errors in your code.
For the sake of convenience, I have reproduced your sp_merrel_df_raw
here as a data.frame
:
structure(list(product_name = c("Merrell Riverbed 3", "Sapatilhas Montanha Merrell", "Merrell Moab Adventure", "Merrell Moab 2 Vent"),
price_new = c("59,99 €", "149,99 €", "99,99 €", "79,99 €"),
price_old = c("69,99 €", NA, NA, "99,99 €"),
date = c(210720, 210720, 210720, 210720)),
row.names = c(NA, -4L),
class = "data.frame")
Serendipitously, I clarified this exact confusion some time ago. Suppose you have a simple numeric
variable a
and a simple function doubler()
:
x <- 2
doubler <- function(num) {
num <- 2 * num
}
Now simply running doubler(x)
will do absolutely nothing aside from (invisibly) returning 4
. All that happens is that the parameter num
is passed the value 2
, and then num
is overwritten with 4
within the scope of the function. However, the original variable x
remains untouched:
doubler(x)
x
# [1] 2
In order to modify x
, we must overwrite it (<-
) with the results of the function:
x <- doubler(x)
x
# [1] 4
Analogously, when you run your currency_to_numeric()
function
currency_to_numeric(raw = sp_merrel_df_raw, clean = sp_merrel_df_clean2, var = price_new)
it will accept the value of sp_merrel_df_clean2
, and assign that value to its clean
parameter. Everything that happens afterward
clean <- raw %>% # ...
affects only clean
within the scope of the function. When all is said and done, sp_merrel_df_clean2
will never be affected.
Instead, something like this is required, to overwrite sp_merrel_df_clean2
with the new value:
currency_to_numeric <- function(raw, ...) {
# ...
}
sp_merrel_df_clean2 <- currency_to_numeric(raw = sp_merrel_df_raw, ...)
As discussed in the dplyr
documentation
env-variables are “programming” variables that live in an environment
whereas
data-variables are “statistical” variables that live in a data frame.
Now data-variables are "masked" in the context of certain functions, especially in dplyr
. Such masking lets us refer to the sp_merrel_df_raw$price_new
column as simply price_new
when we perform (say) a mutate()
on sp_merrel_df_raw
:
sp_merrel_df_raw %>%
mutate(
price_new = sub(" €", "", price_new)
# ...
)
However, when you run your currency_to_numeric()
function
currency_to_numeric(raw = sp_merrel_df_raw, clean = sp_merrel_df_clean2, var = price_new)
var
does not become the sp_merrel_df_raw$price_new
variable itself.
Rather, R looks for some env-variable named price_new
, in surrounding environment (here .GlobalEnv
), and attempts to assign its value to the var
parameter. Naturally, since no such price_new
variable exists in .GlobalEnv
, there is no such value, so R throws an error as soon as it tries to use that value in mutate()
:
Error: Problem with `mutate()` column `var`.
i `var = sub(" \200", "", var)`.
x object 'price_new' not found
This error is comparable to what you would get if you called a function on any other variable that didn't exist:
doubler(num = nonexistent_variable)
# Error in doubler(num = nonexistent_variable) :
# object 'nonexistent_variable' not found
However, even if price_new
were actually floating around in .GlobalEnv
as a typical env-variable, you would still get an error. This is because passing the value of price_new
to var
is not the same as "pasting" the "name" price_new
mutate(
price_new = sub(" €", "", price_new),
price_new = sub(",", ".", price_new),
price_new = as.numeric(price_new))
wherever the "name" var
used to be.
mutate(
var = sub(" €", "", var),
var = sub(",", ".", var),
var = as.numeric(var))
Here's a nifty reworking of currency_to_numeric ()
that closely imitates the typical functionality of dplyr
:
currency_to_numeric <- function(raw, ...) {
raw %>%
mutate(
across(c(...), ~ sub(" €", "", .x)),
across(c(...), ~ sub(",", ".", .x)),
across(c(...), ~ as.numeric(.x))
)
}
As with virtually any R function, you must still assign the results to sp_merrel_df_clean2
, but this solution will help you do so very cleanly
sp_merrel_df_clean2 <- sp_merrel_df_raw %>%
currency_to_numeric(price_new)
with the following results for sp_merrel_df_clean2
:
product_name price_new price_old date
1 Merrell Riverbed 3 59.99 69,99 € 210720
2 Sapatilhas Montanha Merrell 149.99 <NA> 210720
3 Merrell Moab Adventure 99.99 <NA> 210720
4 Merrell Moab 2 Vent 79.99 99,99 € 210720
In fact, you can simultaneously target as many data-variables (like price_new
and price_old
) as you want
sp_merrel_df_clean2 <- sp_merrel_df_raw %>%
currency_to_numeric(price_new, price_old)
and covert all your currency columns in one fell swoop!
product_name price_new price_old date
1 Merrell Riverbed 3 59.99 69.99 210720
2 Sapatilhas Montanha Merrell 149.99 NA 210720
3 Merrell Moab Adventure 99.99 NA 210720
4 Merrell Moab 2 Vent 79.99 99.99 210720
Upvotes: 1
Reputation: 21264
You can use {{}}
('curly-curly') syntax to pass in unquoted variables.
library(tidyverse)
currency_to_numeric <- function(df, col) {
df %>%
mutate(
price_new = sub(" €", "", {{col}}),
price_new = sub(",", ".", {{col}}),
price_new = as.numeric({{col}})
)
}
currency_to_numeric(raw, price_new)
# A tibble: 4 x 4
product_name price_new price_old date
<chr> <dbl> <chr> <date>
1 a 0.54 0,53 € 2021-07-19
2 b 0.41 1,3 € 2021-07-18
3 c 0.08 1,66 € 2021-07-17
4 d 0.35 0,25 € 2021-07-16
Example data
product_name <- letters[1:4]
price_new <- paste(round(abs(rnorm(4)), 2), "€") %>% str_replace(., "\\.", ",")
price_old <- paste(round(abs(rnorm(4)), 2), "€") %>% str_replace(., "\\.", ",")
date <- lubridate::today() - 1:4
raw <- tibble(product_name, price_new, price_old, date)
If you want to update both price_
columns in the same command, you could rewrite currency_to_numeric
to operate on a column, instead of an entire data frame. Like this:
currency_to_numeric2 <- function(x) {
updated = sub(" €", "", x)
updated = sub(",", ".", updated)
updated = as.numeric(updated)
}
raw %>% mutate(across(starts_with("price"), currency_to_numeric2))
# A tibble: 4 x 4
product_name price_new price_old date
<chr> <dbl> <dbl> <date>
1 a 0.54 0.53 2021-07-19
2 b 0.41 1.3 2021-07-18
3 c 0.08 1.66 2021-07-17
4 d 0.35 0.25 2021-07-16
Upvotes: 0