Reputation: 1944
The work I do involves merging several databases from different sources on a key look-up variable. The variable is a string variable and often spelled many different ways depending on the source of the data (i.e., 'New York City', 'City of New York').
I've written a simple function to clean look-up variables in each dataset and am using it in this manner:
clean.names <- function(x){
x %>%
str_remove_all('[:punct:]') %>%
str_to_lower() %>%
str_squish() %>%
str_trim()
}
#df_1 and df_2 are dataframes, with variables key that I use to merge.
df_1 %>%
mutate(clean_name = clean.names(key1)) %>%
left_join(df_2 %>%
mutate(clean_name = clean.names(key2)),
by = 'clean_name')
This function works quite well in doing what it's intended for. However, the code is a bit verbose. My question is: How can I create a function that works with dplyr (i.e., no quotes, etc.) that produces the same effect as what's above? I'd like it to be in true dplyr form and work as a wrapper in the join functions. I have tried my hand at this with no success, see here:
clean.names <- function(x =df, y = merge.vary){ # function adds
x$merge.vary <- y %>% # a new variable
str_remove_all('[:punct:]') %>% # to existing dataframe
str_to_lower() %>%
str_squish() %>%
str_trim()
}
clean.names(df_1, key1) %>% # then use the function as a wrapper
left_join(clean.names(df_2, key2))# for the dplyr join functions
Is there a way to do this? What I'd like to have is a dplyr-esque acting function that looks like: function(dataframe, variable). Thanks.
Upvotes: 4
Views: 488
Reputation: 13691
You can use rlang::ensym()
to capture the variable name and pass it directly to mutate_at
:
library(tidyverse)
clean.names2 <- function( .df, .var ) {
f <- compose( partial(str_remove_all, pattern='[:punct:]'),
str_to_lower, str_squish, str_trim )
.df %>% mutate_at( vars(!!ensym(.var)), f )
}
The function works with variable names provided with and without quotes:
X <- tibble( Cities = c(" New York City, NY", "Denver, CO;;") )
clean.names2( X, Cities )
# # A tibble: 2 x 1
# Cities
# <chr>
# 1 new york city ny
# 2 denver co
clean.names2( X, "Cities" ) ## equivalent
Brief explanation:
The first line creates a composite function by stringing together str_remove_all
, str_to_lower
, str_squish
, str_trim
and using partial()
to assign a prespecified value to the pattern
parameter of str_remove_all
. The resulting function f
is identical to your original clean.names
. (I just tried to make clean.names2
self-contained.)
The second line applies the new composite function f
(or equivalently, your original clean.names
) to a single column in .df
using mutate_at
. The column is specified using two tidyverse mechanisms. The first is vars()
, which allows users to specify column names with and without quotes. For example, the following two lines are equivalent:
mydf %>% mutate_at( vars("mycolumn"), myfunction ) # vars can be dropped here
mydf %>% mutate_at( vars(mycolumn), myfunction )
The second mechanism belongs to the class of functions that deal with quasiquoatation, allowing programmers to work directly with unevaluated expressions, including those provided by the function users. In particular, we use ensym
to capture a symbol provided to the function by the calling environment and pass that symbol to vars
. The !!
is important, because it tells vars
to go ahead and evaluate the expression ensym(.var)
and use the result as the column name. Without !!
, vars
would be trying to find a column with the name "ensym(.var)" instead.
Upvotes: 3