Keith Park
Keith Park

Reputation: 617

Select groups based on number of unique / distinct values

I have a data frame like below

sample <- data.frame(ID = 1:9,
                     Group = c('AA','AA','AA','BB','BB','CC','CC','BB','CC'),
                     Value = c(1,1,1,2,2,2,3,2,3))

ID       Group    Value
1        AA       1
2        AA       1
3        AA       1
4        BB       2
5        BB       2
6        CC       2
7        CC       3
8        BB       2
9        CC       3

I want to select groups according to the number of distinct (unique) values within each group. For example, select groups where all values within the group are the same (one distinct value per group). If you look at the group CC, it has more than one distinct value (2 and 3) and should thus be removed. The other groups, with only one distinct value, should be kept. Desired output:

ID       Group    Value
1        AA       1
2        AA       1
3        AA       1
4        BB       2
5        BB       2
8        BB       2

Would you tell me simple and fast code in R that solves the problem?

Upvotes: 10

Views: 682

Answers (4)

thelatemail
thelatemail

Reputation: 93938

data.table version:

library(data.table)
sample <- as.data.table(sample)
sample[ , if(uniqueN(Value) == 1) .SD, by = Group]

#   Group ID Value
#1:    AA  1     1
#2:    AA  2     1
#3:    AA  3     1
#4:    BB  4     2
#5:    BB  5     2
#6:    BB  8     2

An alternative using ave if the data is numeric, is to check if the variance is 0:

sample[with(sample, ave(Value, Group, FUN=var ))==0,]

An alternative solution that could be faster on large data is:

setkey(sample, Group, Value)
ans <- sample[unique(sample)[, .N, by=Group][N==1, Group]]

The point is that calculating unique values for each group could be time consuming when there are more groups. Instead, we can set the key on the data.table, then take unique values by key (which is extremely fast) and then count the total values for each group. We then require only those where it is 1. We can then perform a join (which is once again very fast). Here's a benchmark on large data:

require(data.table)
set.seed(1L)
sample <- data.table(ID=1:1e7, 
        Group = sample(rep(paste0("id", 1:1e5), each=100)), 
        Value = sample(2, 1e7, replace=TRUE, prob=c(0.9, 0.1)))

system.time (
    ans1 <- sample[,if(length(unique(Value))==1) .SD ,by=Group]
)
# minimum of three runs
#   user  system elapsed 
# 14.328   0.066  14.382 

system.time ({
    setkey(sample, Group, Value)
    ans2 <- sample[unique(sample)[, .N, by=Group][N==1, Group]]
})
# minimum of three runs
#   user  system elapsed 
#  5.661   0.219   5.877 

setkey(ans1, Group, ID)
setkey(ans2, Group, ID)
identical(ans1, ans2) # [1] TRUE

Upvotes: 7

hadley
hadley

Reputation: 103948

Here's a solution using dplyr:

library(dplyr)

sample <- data.frame(
  ID = 1:9,  
  Group= c('AA', 'AA', 'AA', 'BB', 'BB', 'CC', 'CC', 'BB', 'CC'),  
  Value = c(1, 1, 1, 2, 2, 2, 3, 2, 3)
)

sample %>%
  group_by(Group) %>%
  filter(n_distinct(Value) == 1)

We group the data by Group, and then only select groups where the number of distinct values of Value is 1.

Upvotes: 9

John
John

Reputation: 23768

You can make a selector for sample using ave many different ways.

sample[ ave( sample$Value, sample$Group, FUN = function(x) length(unique(x)) ) == 1,]

or

sample[ ave( sample$Value, sample$Group, FUN = function(x) sum(x - x[1]) ) == 0,]

or

sample[ ave( sample$Value, sample$Group, FUN = function(x) diff(range(x)) ) == 0,]

Upvotes: 6

Jilber Urbina
Jilber Urbina

Reputation: 61214

Here's an approach

> ind <- aggregate(Value~Group, FUN=function(x) length(unique(x))==1, data=sample)[,2]
> sample[sample[,"Group"] %in% levels(sample[,"Group"])[ind], ]
  ID Group Value
1  1    AA     1
2  2    AA     1
3  3    AA     1
4  4    BB     2
5  5    BB     2
8  8    BB     2

Upvotes: 2

Related Questions