cacti5
cacti5

Reputation: 2106

Difference between Distinct vs Unique

What are the differences between distinct and unique in R using dplyr in consideration to:

For example:

library(dplyr)
data(iris)

# creating data with duplicates
iris_dup <- bind_rows(iris, iris)

d <- distinct(iris_dup)
u <- unique(iris_dup)

all(d==u) # returns True

In this example distinct and unique perform the same function. Are there examples of times you should use one but not the other? Are there any tricks or common uses of one?

Upvotes: 20

Views: 31176

Answers (3)

peacefulzephyr
peacefulzephyr

Reputation: 1

If you use data.table, there's reason to suspect that unique() is faster than distinct() when operated on a dt object. Here I find that unique()'s speed advantage does not manifest when n <= 1e+06 and manifests when n>=1e+07, following @RobertMyles's approach:

library(dplyr)
library(tictoc)
library(glue)
library(data.table)

make_a_df <- function(nrows = NULL){
  df <- data.table(
    alpha = sample(letters, nrows, replace = TRUE),
    numeric = rnorm(mean = 0, sd = 1, n = nrows)
  )
  tic()
  unique(df, by = 'alpha')
  print(glue('Unique with {nrows}: '))
  toc()
  
  tic()
  distinct(df, alpha, .keep_all = TRUE)
  print(glue('Distinct with {nrows}: '))
  toc()
}

make_a_df(10000000); make_a_df(1000000)
Unique with 1e+06: 
0 sec elapsed
Distinct with 1e+06: 
0.02 sec elapsed
Unique with 1e+07: 
0.06 sec elapsed
Distinct with 1e+07: 
0.62 sec elapsed

Upvotes: 0

RobertMyles
RobertMyles

Reputation: 2832

With regard to two of your criteria, speed and input, here's a little function using the tictoc library. It shows that distinct() is notably faster (the input has numeric and character columns):

library(dplyr)
library(tictoc)
library(glue)

make_a_df <- function(nrows = NULL){
  tic()
  df <- tibble(
    alpha = sample(letters, nrows, replace = TRUE),
    numeric = rnorm(mean = 0, sd = 1, n = nrows)
  )
  unique(df)
  print(glue('Unique with {nrows}: '))
  toc()

  tic()
  df <- tibble(
    alpha = sample(letters, nrows, replace = TRUE),
    numeric = rnorm(mean = 0, sd = 1, n = nrows)
  )
  distinct(df)
  print(glue('Distinct with {nrows}: '))
  toc()
}

Result:

> make_a_df(50); make_a_df(500); make_a_df(5000); make_a_df(50000); make_a_df(500000)
Unique with 50: 
0.02 sec elapsed
Distinct with 50: 
0 sec elapsed
Unique with 500: 
0 sec elapsed
Distinct with 500: 
0 sec elapsed
Unique with 5000: 
0.02 sec elapsed
Distinct with 5000: 
0 sec elapsed
Unique with 50000: 
0.09 sec elapsed
Distinct with 50000: 
0.01 sec elapsed
Unique with 5e+05: 
1.77 sec elapsed
Distinct with 5e+05: 
0.34 sec elapsed

Upvotes: 5

Raj Padmanabhan
Raj Padmanabhan

Reputation: 540

These functions may be used interchangeably, as there exists equivalent commands in both functions. The main difference lies in the speed and the output format.

distinct() is a function under the package dplyr, and may be customized. For example, the following snippet returns only the distinct elements of a specified set of columns in the dataframe

distinct(iris_dup, Petal.Width, Species)

unique() strictly returns the unique rows in a dataframe. All the elements in each row must match in order to be termed as duplicates.

Edit: As Imo points out, unique() has a similar functionality. We obtain a temporary dataframe and find the unique rows from that. This process may be slower for large dataframes.

unique(iris_dup[c("Petal.Width", "Species")])

Both return the same output (albeit with a small difference - they indicate different row numbers). distinct returns an ordered list, whereas unique returns the row number of the first occurrence of each unique element.

     Petal.Width    Species
1          0.2     setosa
2          0.4     setosa
3          0.3     setosa
4          0.1     setosa
5          0.5     setosa
6          0.6     setosa
7          1.4 versicolor
8          1.5 versicolor
9          1.3 versicolor
10         1.6 versicolor
11         1.0 versicolor
12         1.1 versicolor
13         1.8 versicolor
14         1.2 versicolor
15         1.7 versicolor
16         2.5  virginica
17         1.9  virginica
18         2.1  virginica
19         1.8  virginica
20         2.2  virginica
21         1.7  virginica
22         2.0  virginica
23         2.4  virginica
24         2.3  virginica
25         1.5  virginica
26         1.6  virginica
27         1.4  virginica

Overall, both functions return the unique row elements based on the combined set of columns chosen. However, I am inclined to quote the dplyr library and state that distinct is faster.

Upvotes: 17

Related Questions