Reputation: 15708
My question involves the distinct
function from dplyr
.
First, set up the data:
set.seed(0)
df <- data.frame(
x = sample(10, 100, rep = TRUE),
y = sample(10, 100, rep = TRUE)
)
Consider the following two uses of distinct
.
df %>%
group_by(x) %>%
distinct()
df %>%
group_by(x) %>%
distinct(y)
The first produces a different result to the second. As far as I can tell, the first set of operations finds "All distinct values of x
, and return first value of y
", where as the second finds "For each value of x
, find all distinct values of y
".
Why should this be so when
df %>%
distinct(x, y)
df %>% distinct()
produce the same result?
EDIT: It looks like this is a known bug already: https://github.com/hadley/dplyr/issues/1110
Upvotes: 6
Views: 324
Reputation: 17790
As far as I can tell, the answer is that distinct
considers grouping columns when determining distinctness, which to me seems inconsistent with how the rest of dplyr
works.
Thus:
df %>%
group_by(x) %>%
distinct()
Group by x
, find values that are distinct in x
(!). This seems to be a bug.
However:
df %>%
group_by(x) %>%
distinct(y)
Group by x
, find values that are distinct in y
given x
. This is equivalent to either of these cases:
df %>%
distinct(x, y)
df %>% distinct()
Both find distinct values in x and y.
The take-home message seems to be: Don't use grouping and distinct
. Just use the relevant column names as arguments in distinct
.
Upvotes: 2