Reputation: 3919

Efficient R code for finding indices associated with unique values in vector

Suppose I have vector vec <- c("D","B","B","C","C").

My objective is to end up with a list of dimension length(unique(vec)), where each i of this list returns a vector of indices which denote the locations of unique(vec)[i] in vec.

For example, this list for vec would return:

exampleList <- list()
exampleList[[1]] <- c(1) #Since "D" is the first element
exampleList[[2]] <- c(2,3) #Since "B" is the 2nd/3rd element.
exampleList[[3]] <- c(4,5) #Since "C" is the 4th/5th element.

I tried the following approach but it's too slow. My example is large so I need faster code:

vec <- c("D","B","B","C","C")
uniques <- unique(vec)
exampleList <- lapply(1:3,function(i) {
    which(vec==uniques[i])
})
exampleList

Upvotes: 12

Answers (4)

eddi

Reputation: 49448

Update: The behaviour DT[, list(list(.)), by=.] sometimes resulted in wrong results in R version >= 3.1.0. This is now fixed in commit #1280 in the current development version of data.table v1.9.3. From NEWS:

DT[, list(list(.)), by=.] returns correct results in R >=3.1.0 as well. The bug was due to recent (welcoming) changes in R v3.1.0 where list(.) does not result in a copy. Closes #481.

Using data.table is about 15x faster than tapply:

library(data.table)

vec <- c("D","B","B","C","C")

dt = as.data.table(vec)[, list(list(.I)), by = vec]
dt
#   vec  V1
#1:   D   1
#2:   B 2,3
#3:   C 4,5

# to get it in the desired format
# (perhaps in the future data.table's setnames will work for lists instead)
setattr(dt$V1, 'names', dt$vec)
dt$V1
#$D
#[1] 1
#
#$B
#[1] 2 3
#
#$C
#[1] 4 5

Speed tests:

vec = sample(letters, 1e7, T)

system.time(tapply(seq_along(vec), vec, identity)[unique(vec)])
#   user  system elapsed 
#   7.92    0.35    8.50 

system.time({dt = as.data.table(vec)[, list(list(.I)), by = vec]; setattr(dt$V1, 'names', dt$vec); dt$V1})
#   user  system elapsed 
#   0.39    0.09    0.49

Upvotes: 7

lebatsnok

Reputation: 6479

split(seq_along(vec), vec)

this is faster and shorter than tapply solution:

vec = sample(letters, 1e7, T)
system.time(res1 <- tapply(seq_along(vec), vec, identity)[unique(vec)])
#   user  system elapsed 
#  1.808   0.364   2.176 
system.time(res2 <- split(seq_along(vec), vec))
#   user  system elapsed 
#  0.876   0.152   1.029

Upvotes: 7

Jeff Keller

Reputation: 781

To maintain the order of josilber's answer, simply index the result by the uniques vector you created:

vec <- c("D","B","B","C","C")

uniques <- unique(vec)

tapply(seq_along(vec), vec, identity)[uniques]

# $D
# [1] 1
#
# $B
# [1] 2 3
#
# $C
# [1] 4 5

Upvotes: 1

josliber

Reputation: 44340

You can do this with tapply:

vec <- c("D", "B", "B", "C", "C")
tapply(seq_along(vec), vec, identity)[unique(vec)]
# $D
# [1] 1
# 
# $B
# [1] 2 3
# 
# $C
# [1] 4 5

The identity function returns its argument as its result, and indexing by unique(vec) ensures you get it back in the same order of the elements in your original vector.

Upvotes: 5

Efficient R code for finding indices associated with unique values in vector

Answers (4)

Related Questions