elliot
elliot

Reputation: 1944

dplyr-friendly user function to use with join functions

The work I do involves merging several databases from different sources on a key look-up variable. The variable is a string variable and often spelled many different ways depending on the source of the data (i.e., 'New York City', 'City of New York').

I've written a simple function to clean look-up variables in each dataset and am using it in this manner:

clean.names <- function(x){
  x %>% 
    str_remove_all('[:punct:]') %>% 
    str_to_lower() %>% 
    str_squish() %>% 
    str_trim()
}

#df_1 and df_2 are dataframes, with variables key that I use to merge. 
df_1 %>% 
  mutate(clean_name = clean.names(key1)) %>% 
  left_join(df_2 %>% 
              mutate(clean_name = clean.names(key2)), 
            by = 'clean_name')

This function works quite well in doing what it's intended for. However, the code is a bit verbose. My question is: How can I create a function that works with dplyr (i.e., no quotes, etc.) that produces the same effect as what's above? I'd like it to be in true dplyr form and work as a wrapper in the join functions. I have tried my hand at this with no success, see here:

clean.names <- function(x =df, y = merge.vary){ # function adds
  x$merge.vary <- y %>%                         # a new variable
    str_remove_all('[:punct:]') %>%             # to existing dataframe
    str_to_lower() %>% 
    str_squish() %>% 
    str_trim()
}


clean.names(df_1, key1) %>%         # then use the function as a wrapper
  left_join(clean.names(df_2, key2))# for the dplyr join functions 

Is there a way to do this? What I'd like to have is a dplyr-esque acting function that looks like: function(dataframe, variable). Thanks.

Upvotes: 4

Views: 488

Answers (1)

Artem Sokolov
Artem Sokolov

Reputation: 13691

You can use rlang::ensym() to capture the variable name and pass it directly to mutate_at:

library(tidyverse)
clean.names2 <- function( .df, .var ) {
  f <- compose( partial(str_remove_all, pattern='[:punct:]'), 
                 str_to_lower, str_squish, str_trim )
  .df %>% mutate_at( vars(!!ensym(.var)), f )
}

The function works with variable names provided with and without quotes:

X <- tibble( Cities = c("  New York City, NY", "Denver, CO;;") )
clean.names2( X, Cities )
# # A tibble: 2 x 1
#   Cities          
#   <chr>           
# 1 new york city ny
# 2 denver co       

clean.names2( X, "Cities" )  ## equivalent

Brief explanation:

The first line creates a composite function by stringing together str_remove_all, str_to_lower, str_squish, str_trim and using partial() to assign a prespecified value to the pattern parameter of str_remove_all. The resulting function f is identical to your original clean.names. (I just tried to make clean.names2 self-contained.)

The second line applies the new composite function f (or equivalently, your original clean.names) to a single column in .df using mutate_at. The column is specified using two tidyverse mechanisms. The first is vars(), which allows users to specify column names with and without quotes. For example, the following two lines are equivalent:

mydf %>% mutate_at( vars("mycolumn"), myfunction )  # vars can be dropped here
mydf %>% mutate_at( vars(mycolumn), myfunction )

The second mechanism belongs to the class of functions that deal with quasiquoatation, allowing programmers to work directly with unevaluated expressions, including those provided by the function users. In particular, we use ensym to capture a symbol provided to the function by the calling environment and pass that symbol to vars. The !! is important, because it tells vars to go ahead and evaluate the expression ensym(.var) and use the result as the column name. Without !!, vars would be trying to find a column with the name "ensym(.var)" instead.

Upvotes: 3

Related Questions