stackinator
stackinator

Reputation: 5819

R count number of distincts not consistent

library(dplyr)

distinct(mtcars, mpg) displays the unique occurrences of mpg classes in mtcars.

n_distinct(mtcars, mpg) counts them and displays the correct count 32.

distinct(mtcars, cyl) displays the unique occurrences of cylinder classes in mtcars.

n_distinct(mtcars, cyl) yields an error. Why doesn't it work like the mpg example above? I get this incorrect error... the object cyl is in the mtcars data frame, I assure you of that.

Error in n_distinct_multi(list(...), na.rm) : object 'cyl' not found

Upvotes: 2

Views: 1337

Answers (2)

lefft
lefft

Reputation: 2105

The dplyr::n_distinct() function is not a table verb like mutate(), filter(), etc. Its ... parameter should be "vectors of values" (per official documentation).

So when you say dplyr::n_distinct(mtcars, mpg), what is really happening is that the unique values of the first argument mtcars are being counted.

Since it has 32 distinct rows, the value is 32. In the final example you provide, cyl is not recognized because there is no object called cyl -- the reason that mpg is being recognized is that mpg refers to the dataset ggplot2::mpg, not to the column of mtcars with the same name!

To see what I mean, run the following:

dplyr::n_distinct(mtcars)                # 32 
dplyr::n_distinct(ggplot2::mpg)          # 225 
dplyr::n_distinct(mtcars, mpg)           # 32 
dplyr::n_distinct(mtcars, ggplot2::mpg)  # 32 

If you want to count the number of unique values in mtcars$cyl and mtcars$mpg, then just use:

dplyr::n_distinct(mtcars$cyl) # 3 
dplyr::n_distinct(mtcars$mpg) # 25 

A tricky one!

Upvotes: 9

Eugene Brown
Eugene Brown

Reputation: 4362

Your call to n_distinct(mtcars, mpg) is not returning the correct value, which would be 25. Instead, this line is giving you the number of unique rows in the whole table mtcars, which is 32 according to the output of distinct(mtcars).

What you want to call is n_distinct(mtcars$mpg) which returns 25, or similarly on the cyl column you would want to say n_distinct(mtcars$cyl) or n_distinct(mtcars[["cyl"]]) (equivalent).

> distinct(mtcars, cyl)
  cyl
1   6
2   4
3   8
> n_distinct(mtcars$cyl)
[1] 3

Upvotes: 3

Related Questions