Reputation: 617
I have a data frame like below
sample <- data.frame(ID = 1:9,
Group = c('AA','AA','AA','BB','BB','CC','CC','BB','CC'),
Value = c(1,1,1,2,2,2,3,2,3))
ID Group Value
1 AA 1
2 AA 1
3 AA 1
4 BB 2
5 BB 2
6 CC 2
7 CC 3
8 BB 2
9 CC 3
I want to select groups according to the number of distinct (unique) values within each group. For example, select groups where all values within the group are the same (one distinct value per group). If you look at the group CC, it has more than one distinct value (2 and 3) and should thus be removed. The other groups, with only one distinct value, should be kept. Desired output:
ID Group Value
1 AA 1
2 AA 1
3 AA 1
4 BB 2
5 BB 2
8 BB 2
Would you tell me simple and fast code in R that solves the problem?
Upvotes: 10
Views: 682
Reputation: 93938
data.table
version:
library(data.table)
sample <- as.data.table(sample)
sample[ , if(uniqueN(Value) == 1) .SD, by = Group]
# Group ID Value
#1: AA 1 1
#2: AA 2 1
#3: AA 3 1
#4: BB 4 2
#5: BB 5 2
#6: BB 8 2
An alternative using ave
if the data is numeric, is to check if the variance is 0:
sample[with(sample, ave(Value, Group, FUN=var ))==0,]
An alternative solution that could be faster on large data is:
setkey(sample, Group, Value)
ans <- sample[unique(sample)[, .N, by=Group][N==1, Group]]
The point is that calculating unique
values for each group could be time consuming when there are more groups. Instead, we can set the key on the data.table
, then take unique
values by key (which is extremely fast) and then count the total values for each group. We then require only those where it is 1. We can then perform a join
(which is once again very fast). Here's a benchmark on large data:
require(data.table)
set.seed(1L)
sample <- data.table(ID=1:1e7,
Group = sample(rep(paste0("id", 1:1e5), each=100)),
Value = sample(2, 1e7, replace=TRUE, prob=c(0.9, 0.1)))
system.time (
ans1 <- sample[,if(length(unique(Value))==1) .SD ,by=Group]
)
# minimum of three runs
# user system elapsed
# 14.328 0.066 14.382
system.time ({
setkey(sample, Group, Value)
ans2 <- sample[unique(sample)[, .N, by=Group][N==1, Group]]
})
# minimum of three runs
# user system elapsed
# 5.661 0.219 5.877
setkey(ans1, Group, ID)
setkey(ans2, Group, ID)
identical(ans1, ans2) # [1] TRUE
Upvotes: 7
Reputation: 103948
Here's a solution using dplyr:
library(dplyr)
sample <- data.frame(
ID = 1:9,
Group= c('AA', 'AA', 'AA', 'BB', 'BB', 'CC', 'CC', 'BB', 'CC'),
Value = c(1, 1, 1, 2, 2, 2, 3, 2, 3)
)
sample %>%
group_by(Group) %>%
filter(n_distinct(Value) == 1)
We group the data by Group
, and then only select groups where the number of distinct values of Value
is 1.
Upvotes: 9
Reputation: 23768
You can make a selector for sample
using ave
many different ways.
sample[ ave( sample$Value, sample$Group, FUN = function(x) length(unique(x)) ) == 1,]
or
sample[ ave( sample$Value, sample$Group, FUN = function(x) sum(x - x[1]) ) == 0,]
or
sample[ ave( sample$Value, sample$Group, FUN = function(x) diff(range(x)) ) == 0,]
Upvotes: 6
Reputation: 61214
Here's an approach
> ind <- aggregate(Value~Group, FUN=function(x) length(unique(x))==1, data=sample)[,2]
> sample[sample[,"Group"] %in% levels(sample[,"Group"])[ind], ]
ID Group Value
1 1 AA 1
2 2 AA 1
3 3 AA 1
4 4 BB 2
5 5 BB 2
8 8 BB 2
Upvotes: 2