RNA
RNA

Reputation: 153261

how to remove unique values from a vector

I have a large numeric vector - how can I remove the unique values from it efficiently?

To give a simplified example, how can I get from vector a to vector b?

> a = c(1, 2, 3, 3, 2, 4) # 1 and 4 are the unique values
> b = c(2, 3, 3, 2)

Upvotes: 2

Views: 5154

Answers (4)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

To add to the options already available:

a[duplicated(a) | duplicated(a, fromLast=TRUE)]
# [1] 2 3 3 2

Update: More benchmarks!

Comparing Prasanna's answer with mine, and comparing it against asieira's functions, we get the following:

fun1 <- function(x) x[x %in% x[duplicated(x)]]
fun2 <- function(x) x[duplicated(x) | duplicated(x, fromLast=TRUE)]

set.seed(1)
a <- ceiling(runif(1000000, min=0, max=100))

library(microbenchmark)
microbenchmark(remove.uniques1(a), remove.uniques2(a), 
               fun1(a), fun2(a), times = 20)
# Unit: milliseconds
#                expr       min        lq    median        uq       max neval
#  remove.uniques1(a) 1957.9565 1971.3125 2002.7045 2057.0911 2151.1178    20
#  remove.uniques2(a) 2049.9714 2065.6566 2095.4877 2146.3000 2210.6742    20
#             fun1(a)  213.6129  216.6337  219.2829  297.3085  303.9394    20
#             fun2(a)  154.0829  155.5459  155.9748  158.9121  246.2436    20

I suspect that the number of unique values would also make a difference in terms of the efficiency of these approaches.

Upvotes: 4

asieira
asieira

Reputation: 3683

One vectorized way to do this is to use the built-in table function to find which values only appear once, and then remove them from the vector:

> a = c(1, 2, 3, 3, 2, 4)
> tb.a = table(a)
> appears.once = as.numeric(names(tb.a[tb.a==1]))
> appears.once
[1] 1 4
> b = a[!a %in% appears.once]
> b
[1] 2 3 3 2

Notice the table function converts the values from the original vector to the names, which is character. So we need to convert it back to numeric in your example.

Another way of doing that with data.table:

> dt.a = data.table(a=a)
> dt.a[,count:=.N,by=a]
> b = dt.a[count>1]$a
> b
[1] 2 3 3 2

Now let's time them:

remove.uniques1 <- function(x) {
  tb.x = table(x)
  appears.once = as.numeric(names(tb.x[tb.x==1]))
  return(x[!x %in% appears.once])
}

remove.uniques2 <- function(x) {
  dt.x = data.table(data=x)
  dt.x[,count:=.N,by=data]
  return(dt.x[count>1]$data)
}

> a = ceiling(runif(1000000, min=0, max=100))
> system.time( remove.uniques1(a) )
     user    system   elapsed 
    1.598     0.033     1.658 
> system.time( remove.uniques2(a) )
     user    system   elapsed 
    0.845     0.007     0.855

So both are pretty fast, but the data.table version is clearly faster. Not to mention remove.uniques2 preserves whatever type the input vector is. In the case of remove.uniques1, however, you have to replace the call to as.numeric to whatever fits the type of your original vector.

Upvotes: 1

Prasanna Nandakumar
Prasanna Nandakumar

Reputation: 4335

a[a %in% a[duplicated(a)]]
[1] 2 3 3 2

Upvotes: 2

symbiotic
symbiotic

Reputation: 373

This should give the right answer.

a = c(1, 2, 3, 3, 2, 4)
dups <- duplicated(a)
dup.val <- a[dups]
a[a %in% dup.val]

Upvotes: 1

Related Questions