wahalulu
wahalulu

Reputation: 1467

R equivalent of SELECT DISTINCT on two or more fields/variables

Say I have a dataframe df with two or more columns, is there an easy way to use unique() or other R function to create a subset of unique combinations of two or more columns?

I know I can use sqldf() and write an easy "SELECT DISTINCT var1, var2, ... varN" query, but I am looking for an R way of doing this.

It occurred to me to try ftable coerced to a dataframe and use the field names, but I also get the cross tabulations of combinations that don't exist in the dataset:

uniques <- as.data.frame(ftable(df$var1, df$var2))

Upvotes: 43

Views: 112102

Answers (4)

Marek
Marek

Reputation: 50763

unique works on data.frame so unique(df[c("var1","var2")]) should be what you want.

Another option is distinct from dplyr package:

df %>% distinct(var1, var2) # or distinct(df, var1, var2)

Note:

For older versions of dplyr (< 0.5.0, 2016-06-24) distinct required additional step

df %>% select(var1, var2) %>% distinct

(or oldish way distinct(select(df, var1, var2))).

Upvotes: 65

Zaki
Zaki

Reputation: 141

In addition to answers above, the data.table version:

setDT(df)

unique_dt = unique(df, by = c('var1', 'var2'))

Upvotes: 2

tjebo
tjebo

Reputation: 23797

@Marek's answer is obviously correct, but may be outdated. The current dplyrversion (0.7.4) allows for an even simpler code:

Simply use:

df %>% distinct(var1, var2)

If you want to keep all columns, add

df %>% distinct(var1, var2, .keep_all = TRUE)

Upvotes: 29

sbaniwal
sbaniwal

Reputation: 337

To KEEP all other variables in df use this:

unique_rows <- !duplicated(df[c("var1","var2")])

unique.df <- df[unique_rows,]

Another less recommended method is using row.names() #(see David's comment below):

unique_rows <- row.names(unique(df[c("var1","var2")]))

unique.df <- df[unique_rows,]

Upvotes: 5

Related Questions