Reputation: 153261
I have a large numeric vector - how can I remove the unique values from it efficiently?
To give a simplified example, how can I get from vector a
to vector b
?
> a = c(1, 2, 3, 3, 2, 4) # 1 and 4 are the unique values
> b = c(2, 3, 3, 2)
Upvotes: 2
Views: 5154
Reputation: 193517
To add to the options already available:
a[duplicated(a) | duplicated(a, fromLast=TRUE)]
# [1] 2 3 3 2
Update: More benchmarks!
Comparing Prasanna's answer with mine, and comparing it against asieira's functions, we get the following:
fun1 <- function(x) x[x %in% x[duplicated(x)]]
fun2 <- function(x) x[duplicated(x) | duplicated(x, fromLast=TRUE)]
set.seed(1)
a <- ceiling(runif(1000000, min=0, max=100))
library(microbenchmark)
microbenchmark(remove.uniques1(a), remove.uniques2(a),
fun1(a), fun2(a), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# remove.uniques1(a) 1957.9565 1971.3125 2002.7045 2057.0911 2151.1178 20
# remove.uniques2(a) 2049.9714 2065.6566 2095.4877 2146.3000 2210.6742 20
# fun1(a) 213.6129 216.6337 219.2829 297.3085 303.9394 20
# fun2(a) 154.0829 155.5459 155.9748 158.9121 246.2436 20
I suspect that the number of unique values would also make a difference in terms of the efficiency of these approaches.
Upvotes: 4
Reputation: 3683
One vectorized way to do this is to use the built-in table function to find which values only appear once, and then remove them from the vector:
> a = c(1, 2, 3, 3, 2, 4)
> tb.a = table(a)
> appears.once = as.numeric(names(tb.a[tb.a==1]))
> appears.once
[1] 1 4
> b = a[!a %in% appears.once]
> b
[1] 2 3 3 2
Notice the table function converts the values from the original vector to the names, which is character. So we need to convert it back to numeric in your example.
Another way of doing that with data.table:
> dt.a = data.table(a=a)
> dt.a[,count:=.N,by=a]
> b = dt.a[count>1]$a
> b
[1] 2 3 3 2
Now let's time them:
remove.uniques1 <- function(x) {
tb.x = table(x)
appears.once = as.numeric(names(tb.x[tb.x==1]))
return(x[!x %in% appears.once])
}
remove.uniques2 <- function(x) {
dt.x = data.table(data=x)
dt.x[,count:=.N,by=data]
return(dt.x[count>1]$data)
}
> a = ceiling(runif(1000000, min=0, max=100))
> system.time( remove.uniques1(a) )
user system elapsed
1.598 0.033 1.658
> system.time( remove.uniques2(a) )
user system elapsed
0.845 0.007 0.855
So both are pretty fast, but the data.table version is clearly faster. Not to mention remove.uniques2 preserves whatever type the input vector is. In the case of remove.uniques1, however, you have to replace the call to as.numeric to whatever fits the type of your original vector.
Upvotes: 1
Reputation: 373
This should give the right answer.
a = c(1, 2, 3, 3, 2, 4)
dups <- duplicated(a)
dup.val <- a[dups]
a[a %in% dup.val]
Upvotes: 1