ulfelder
ulfelder

Reputation: 5335

Unexpected behavior with n_distinct inside pipe

I am trying to use the n_distinct function from dplyr inside a pipe in a function and am finding it to be sensitive to my choice of syntax in a way I didn't expect. Here's a toy example.

# preliminaries
library(tidyverse)
set.seed(123)
X <- data.frame(a1 = rnorm(10), a2 = rnorm(10), b = rep(LETTERS[1:5], times = 2), stringsAsFactors = FALSE)
print(X)
            a1         a2 b
1  -0.56047565  1.2240818 A
2  -0.23017749  0.3598138 B
3   1.55870831  0.4007715 C
4   0.07050839  0.1106827 D
5   0.12928774 -0.5558411 E
6   1.71506499  1.7869131 A
7   0.46091621  0.4978505 B
8  -1.26506123 -1.9666172 C
9  -0.68685285  0.7013559 D
10 -0.44566197 -0.4727914 E

Okay, now let's say I want to iterate a function over the names of selected columns in that data frame (humor me). Here, I'm going to use values in the selected column to filter the initial data set, count the number of unique ids that remain, and return the results as a one-row tibble that I then bind into a new tibble. When I create a new tibble inside the function and then apply n_distinct to a selected column in that tibble as its own step, I get the expected results from n_distinct, 5 and 4.

bind_rows(map(str_subset(colnames(X), "a"), function(i) {

  subdf <- filter(X, !!sym(i) > 0)

  value <- n_distinct(subdf$b)

  tibble(y = i, n_uniq = value)

}))

# A tibble: 2 x 2
  y     n_uniq
  <chr>  <int>
1 a1         5
2 a2         4

If I put n_distinct inside a pipe and use . to refer to the filtered tibble, however, the code executes but I get a different and incorrect result.

bind_rows(map(str_subset(colnames(X), "a"), function(i) {

  value <- filter(X, !!sym(i) > 0) %>% n_distinct(.$b)

  tibble(y = i, n_uniq = value)

}))

# A tibble: 2 x 2
  y     n_uniq
  <chr>  <int>
1 a1         5
2 a2         7

What's up with that? Am I misunderstanding the use of . inside a pipe? Is something funky with n_distinct?

Upvotes: 4

Views: 1051

Answers (4)

user63230
user63230

Reputation: 4708

I think its worth having its own (simple) function for when you are using n_distinct at the end of pipe which saves having to remember the type of syntax to use:

n_distinct_end_of_pipe <- function(data, variable) {
  data %>%
    select(!!rlang::enquo(variable)) %>% 
    n_distinct  
  
}

iris %>% 
  n_distinct_end_of_pipe(Species)
# 3

Upvotes: 0

T_R
T_R

Reputation: 33

Agreed...if you're just looking for the number of distinct as in Adam's last example, you might be better off with length(unique(iris$Species)) depending on what your goals are

Upvotes: 0

user10917479
user10917479

Reputation:

Here is minimal example of I think what you are seeing.

iris %>%
  n_distinct(.$Species)
# 149

n_distinct(iris$Species)
# 3

The first option is actually doing as follows. The .$Species is redundant.

n_distinct(iris, iris$Species)
# 149

I think to pipe it without doing weird syntax things you need to use this.

iris %>%
  distinct(Species) %>% 
  count()
# 3

Upvotes: 3

IceCreamToucan
IceCreamToucan

Reputation: 28705

n_distinct accepts multiple arguments and here you're actually passing both the tibble and the b column as arguments, since the left-hand-side of pipe is passed by default. Here's some other ways of getting the expected output:

filter(X, !!sym(i) > 0) %>% 
  {n_distinct(.$b)}

filter(X, !!sym(i) > 0) %>% 
  with(n_distinct(b))

library(magrittr)

filter(X, !!sym(i) > 0) %$% 
  n_distinct(b)

Also, not directly related to your question, there's a convenience function for this kind of thing

map_dfr(str_subset(colnames(X), "a"), function(i) {

  value <- filter(X, !!sym(i) > 0) %>% {n_distinct(.$b)}

  tibble(y = i, n_uniq = value)

})

Upvotes: 6

Related Questions