Starbucks
Starbucks

Reputation: 1558

Various results with distinct() in a custom function

I want to create a function in R that will create a numerical column based on a character/categorical column. In order to do this I need to get the distinct values in the categorical column. I can do this outside a function well, but would like to make a reusable function to do it. The issue I've run into is that the same distinct() formula that works outside the function doesn't behave the same way within the formula. I've created a demo below:

# test of call to db to numericize
DF <- data.frame("a" = c("a","b","c","a","b","c"),
                 "b" = paste(0:5, ".1", sep = ""),
                 "c" = letters[1:6],
                 stringsAsFactors = FALSE)

catnum <- function(db, inputcolname) {
  x <- distinct(db,inputcolname);
  print(x);
  return(x);
}

y <- distinct(DF,a)
y
catnum(DF,'a')

While y gives the correct distinct one column answer (one column with (a,b,c) in it), x within the function is the entire dataframe. I have tried with and without the ' ', as in catnum(DF,a) but the results are the same.

Could someone tell me what is happening or suggest some code that would work?

Upvotes: 1

Views: 317

Answers (3)

Ista
Ista

Reputation: 10437

You're inputs are not the same, and so you get different results. If you give distinct the same arguments you give catnum, you will get the same result:

isTRUE(all.equal(distinct(DF, a),
                 catnum(DF, "a")))
## [1] FALSE
isTRUE(all.equal(distinct(DF, "a"),
                 catnum(DF, "a")))
##[1] TRUE

Unfortunately, this does not work:

catnum(DF, a)
##   a   b c
## 1 a 0.1 a
## 2 b 1.1 b
## 3 c 2.1 c
## 4 a 3.1 d
## 5 b 4.1 e
## 6 c 5.1 f

The reason, as explained in

vignette("programming")

is that you must jump through several annoying hoops if you want to write functions that use functions from dplyr. The solution (as you will learn in the vignette) is as follows:

catnum <- function(db, inputcolname) {
  inputcolname <- enquo(inputcolname)  
  distinct(db, !!inputcolname)
}

catnum(DF, a)
##   a
## 1 a
## 2 b
## 3 c

Or you could conclude that this is all too confusing and do something like

catnum <- function(db, inputcolname) {
  unique(db[, inputcolname, drop = FALSE])
}

catnum(DF, "a")
##   a
## 1 a
## 2 b
## 3 c

instead.

Upvotes: 1

MKR
MKR

Reputation: 20085

One solution is to use distinct_ function inside function. The distinct expect column name and it doesn't work with column names in a variable.

For example distinct(DF, "a") will not work. The actual syntax is: distinct(DF, a). Notice the missing quotes. When distinct is called from function then column name was provided as variable name (i.e inputcolname) which was evaluated. Hence unexpected result. But distinct_ works on variable name for columns.

library(dplyr)
catnum <- function(db, inputcolname) {
  x <- distinct_(db,inputcolname);
  #print(x);
  return(x);
}
#With modified function results were as expected.
catnum(DF,'a')
# a
# 1 a
# 2 b
# 3 c

Upvotes: 2

JeanVuda
JeanVuda

Reputation: 1778

Not sure what you are trying to do and where distinct function is coming from. Are you looking for this?

catnum<-function(DF,var){
  length(unique(DF[[var]]))
}
catnum(DF,'a')

Upvotes: 1

Related Questions