Turning dplyr code into function that accepts columns as arguments

Question

I've been fighting trying to understand tidyeval and the use of quo, quos, sym, !!, !!! and the like. I made some attempts, but couldn't generalize my code so it accepts a vector of columns and applies text processing to those columns on a dataframe. My dataframe looks like this:

ocupation      tasks                 id 
 Sink Cleaner   Cleaning the sink    1
 Lion petter    Pet the lions        2

And my code looks like this:

stopwords_regex = paste(tm::stopwords('en'), collapse = '\b|\b')
stopwords_regex = glue('\b{stopwords_regex}\b')


df = df %>% mutate(ocupation_proc = ocupation %>% tolower() %>% 
                     stringi::stri_trans_general("Latin-ASCII") %>% 
                     str_remove_all(stopwords_regex) %>% 
                     str_remove_all("[[:punct:]]") %>%  
                     str_squish(),
                   tasks_proc = tasks %>% tolower() %>% 
                     stringi::stri_trans_general("Latin-ASCII") %>% 
                     str_remove_all(stopwords_regex) %>%
                     str_remove_all("[[:punct:]]") %>% 
                     str_squish())

Which brings something like this:

ocupation      tasks               id    ocupation_proc  tasks_proc
Sink Cleaner   Cleaning the sink   1     sink cleaner   cleaning sink
Lion petter    Pet the lions       2      lion petter    pet lions

I'd like to turn this into a function process_text_columns(df, columns_list, new_col_names) Where in this case df=df, columns_list=c('ocupation', 'tasks') and new_col_names=c('ocupation_proc', 'tasks_proc'), (new_col_names might not even be necessary if I can do something like glue({colname}_proc) to name the new columns). From what I've gathered I'd need to use across, sym, quos and maybe !!! to generalize the function but anything I've tried has failed. Do you have any ideas?

Thanks

Jon Spring · Accepted Answer

Does this work for you as expected? The "curly curly" operator introduced to rlang 0.4 in June 2019 helps simplify the "quote-and-unquote into a single interpolation step."

clean_steps <- function(a_column) {
  a_column %>%
    tolower() %>% 
    stringi::stri_trans_general("Latin-ASCII") %>% 
    str_remove_all(stopwords_regex) %>%
    str_remove_all("[[:punct:]]") %>% 
    str_squish()
}

my_great_function <- function(df, columns_list, new_col_names) {
  mutate(df, across( {{columns_list}}, ~clean_steps(.x))) %>%
    rename( !!new_col_names )
}

my_great_function(df, 
                  c(ocupation, tasks), 
                  c(ocu = "ocupation", tas = "tasks"))

Output

           ocu           tas id
1 sink cleaner cleaning sink  1
2  lion petter     pet lions  2

EDIT: To keep unprocessed columns and add processed with new names, easiest would be to use the .names argument of across:

my_great_function <- function(df, columns_list, new_col_names) {
  mutate(df, across( {{columns_list}}, ~clean_steps(.x), .names = "{.col}_proc"))
}

my_great_function(df, c(ocupation, tasks))


     ocupation             tasks id ocupation_proc    tasks_proc
1 Sink Cleaner Cleaning the sink  1   sink cleaner cleaning sink
2  Lion petter     Pet the lions  2    lion petter     pet lions

Turning dplyr code into function that accepts columns as arguments

Answers (1)

Related Questions