Alex
Alex

Reputation: 15708

Why does dplyr::distinct behave like this for grouped data frames

My question involves the distinct function from dplyr.

First, set up the data:

set.seed(0)

df <- data.frame(
    x = sample(10, 100, rep = TRUE),
    y = sample(10, 100, rep = TRUE)
)

Consider the following two uses of distinct.

df %>%
    group_by(x) %>%
    distinct()

df %>%
    group_by(x) %>%
    distinct(y)

The first produces a different result to the second. As far as I can tell, the first set of operations finds "All distinct values of x, and return first value of y", where as the second finds "For each value of x, find all distinct values of y".

Why should this be so when

df %>%
    distinct(x, y)

df %>% distinct()

produce the same result?

EDIT: It looks like this is a known bug already: https://github.com/hadley/dplyr/issues/1110

Upvotes: 6

Views: 324

Answers (1)

Claus Wilke
Claus Wilke

Reputation: 17790

As far as I can tell, the answer is that distinct considers grouping columns when determining distinctness, which to me seems inconsistent with how the rest of dplyr works.

Thus:

df %>%
group_by(x) %>%
distinct()

Group by x, find values that are distinct in x(!). This seems to be a bug.

However:

df %>%
group_by(x) %>%
distinct(y)

Group by x, find values that are distinct in y given x. This is equivalent to either of these cases:

df %>%
distinct(x, y)

df %>% distinct()

Both find distinct values in x and y.

The take-home message seems to be: Don't use grouping and distinct. Just use the relevant column names as arguments in distinct.

Upvotes: 2

Related Questions