Reputation: 6132
I've been fighting trying to understand tidyeval
and the use of quo
, quos
, sym
, !!
, !!!
and the like. I made some attempts, but couldn't generalize my code so it accepts a vector of columns and applies text processing to those columns on a dataframe. My dataframe looks like this:
ocupation tasks id
Sink Cleaner Cleaning the sink 1
Lion petter Pet the lions 2
And my code looks like this:
stopwords_regex = paste(tm::stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = glue('\\b{stopwords_regex}\\b')
df = df %>% mutate(ocupation_proc = ocupation %>% tolower() %>%
stringi::stri_trans_general("Latin-ASCII") %>%
str_remove_all(stopwords_regex) %>%
str_remove_all("[[:punct:]]") %>%
str_squish(),
tasks_proc = tasks %>% tolower() %>%
stringi::stri_trans_general("Latin-ASCII") %>%
str_remove_all(stopwords_regex) %>%
str_remove_all("[[:punct:]]") %>%
str_squish())
Which brings something like this:
ocupation tasks id ocupation_proc tasks_proc
Sink Cleaner Cleaning the sink 1 sink cleaner cleaning sink
Lion petter Pet the lions 2 lion petter pet lions
I'd like to turn this into a function process_text_columns(df, columns_list, new_col_names)
Where in this case df=df
, columns_list=c('ocupation', 'tasks')
and new_col_names=c('ocupation_proc', 'tasks_proc')
, (new_col_names
might not even be necessary if I can do something like glue({colname}_proc)
to name the new columns). From what I've gathered I'd need to use across
, sym
, quos
and maybe !!!
to generalize the function but anything I've tried has failed. Do you have any ideas?
Thanks
Upvotes: 0
Views: 113
Reputation: 66415
Does this work for you as expected? The "curly curly" operator introduced to rlang 0.4 in June 2019 helps simplify the "quote-and-unquote into a single interpolation step."
clean_steps <- function(a_column) {
a_column %>%
tolower() %>%
stringi::stri_trans_general("Latin-ASCII") %>%
str_remove_all(stopwords_regex) %>%
str_remove_all("[[:punct:]]") %>%
str_squish()
}
my_great_function <- function(df, columns_list, new_col_names) {
mutate(df, across( {{columns_list}}, ~clean_steps(.x))) %>%
rename( !!new_col_names )
}
my_great_function(df,
c(ocupation, tasks),
c(ocu = "ocupation", tas = "tasks"))
Output
ocu tas id
1 sink cleaner cleaning sink 1
2 lion petter pet lions 2
EDIT: To keep unprocessed columns and add processed with new names, easiest would be to use the .names
argument of across
:
my_great_function <- function(df, columns_list, new_col_names) {
mutate(df, across( {{columns_list}}, ~clean_steps(.x), .names = "{.col}_proc"))
}
my_great_function(df, c(ocupation, tasks))
ocupation tasks id ocupation_proc tasks_proc
1 Sink Cleaner Cleaning the sink 1 sink cleaner cleaning sink
2 Lion petter Pet the lions 2 lion petter pet lions
Upvotes: 3