bosbmgatl
bosbmgatl

Reputation: 958

Subset with unique cases, based on multiple columns

I'd like to subset a dataframe to include only rows that have unique combinations of three columns. My situation is similar to the one presented in this question, but I'd like to preserve the other columns in my data as well. Here's my example:

> df
  v1  v2  v3   v4  v5
1  7   1   A  100  98 
2  7   2   A   98  97
3  8   1   C   NA  80
4  8   1   C   78  75
5  8   1   C   50  62
6  9   3   C   75  75

The requested output would be something like this, where I'm looking for unique cases based on v1, v2, and v3 only:

> df.new
  v1  v2  v3   v4  v5
1  7   1   A  100  98 
2  7   2   A   98  97
3  8   1   C   NA  80
6  9   3   C   75  75

If I could recover the non-unique rows that would be great too:

> df.dupes
  v1  v2  v3   v4  v5
3  8   1   C   NA  80
4  8   1   C   78  75
5  8   1   C   50  62

I saw a related question for how to do this in sql (here), but I can't get this in R. I'm sure it's simple but messing with unique() and subset() hasn't been fruitful.

Upvotes: 53

Views: 107193

Answers (7)

Gilbert M.
Gilbert M.

Reputation: 93

Provided you're working with data.table, it's easy to have unique cases with respect to one or more variables. The syntax would be the following

unique(DT, by = c("var1", "var2"))

Upvotes: 3

I know it's an very old question, but anyway thought that the obvious solution using the unique() function should also be presented here:

unique(df[1:3])

or specifiying columns by names:

unique(df[c("v1","v2","v3)]

...and specifying rows:

unique(df[,c("v1","v2","v3)]

Upvotes: 2

sbha
sbha

Reputation: 10422

Using dplyr you could do:

library(dplyr)

# distinct
df %>% 
  distinct(v1, v2, v3, .keep_all = T)

# non-distinct only
df %>% 
  group_by(v1, v2, v3) %>% 
  filter(n() > 1)

# exclude any non-distinct
df %>% 
  group_by(v1, v2, v3) %>% 
  filter(n() == 1)

Upvotes: 28

nullalleles
nullalleles

Reputation: 11

A non-elegant but functional way is to paste the entries of a given row together and find which are unique (or non-unique) rows, something like:

df.vector=apply(df,1,FUN=function(x) {paste(x,collapse="")})
df.table=table(df.vector)

then get the indexes of the duplicates with something like:

which(df.vector%in%names(which(df.table>1)))

Upvotes: 1

monis rahman
monis rahman

Reputation: 29

yeah but using plyr and ddply is very very slow if you have too much data.

you shd try something of this sort:

df[ cbind( which(duplicated(df[1:3])), which(duplicated(df[1:3], fromLast=TRUE))),]

or::

from = which(duplicated(df[1:3])
to = which(duplicated(df[1:3], fromLast=TRUE))
df[cbind(from,to),]

shd be faster for the most part.

test it out and let us know

there are some errors but im guessing you could fix those as long as you get the idea.

also try unique and all that

Upvotes: 2

Ken Williams
Ken Williams

Reputation: 24005

You can use the duplicated() function to find the unique combinations:

> df[!duplicated(df[1:3]),]
  v1 v2 v3  v4 v5
1  7  1  A 100 98
2  7  2  A  98 97
3  8  1  C  NA 80
6  9  3  C  75 75

To get only the duplicates, you can check it in both directions:

> df[duplicated(df[1:3]) | duplicated(df[1:3], fromLast=TRUE),]
  v1 v2 v3 v4 v5
3  8  1  C NA 80
4  8  1  C 78 75
5  8  1  C 50 62

Upvotes: 67

flodel
flodel

Reputation: 89097

You can use the plyr package:

library(plyr)

ddply(df, c("v1","v2","v3"), head, 1)
#   v1 v2 v3  v4 v5
# 1  7  1  A 100 98
# 2  7  2  A  98 97
# 3  8  1  C  NA 80
# 4  9  3  C  75 75

ddply(df, c("v1","v2","v3"), function(x) if(nrow(x)>1) x else NULL)
#   v1 v2 v3 v4 v5
# 1  8  1  C NA 80
# 2  8  1  C 78 75
# 3  8  1  C 50 62

Upvotes: 5

Related Questions