Lukas
Lukas

Reputation: 727

find indices of values within tolerance range in R

say I have vector x

x <- c(1, 1, 1.1, 2, 1, 2.1, 2.6)
tol <- 0.4

how do I get the indices of the groups of elements that are 'unique' within the tolerance range (tol) as in the list below. I don't know how many of these groups there are beforehand.

[[1]]
[1] 1 2 3 5

[[2]]
[1] 4 6

[[3]]
[1] 7

thanks

Upvotes: 6

Views: 1017

Answers (4)

989
989

Reputation: 12937

Here is another attempt with cut function from base R. We first try to create the range vector named sq and then go through x elements that falls within any specific range.

sq <- seq(min(x)-tol,max(x)+tol*2,tol*2)
# [1] 0.6 1.4 2.2 3.0

sapply(1:(length(sq)-1), function(i) which(!is.na(cut(x, breaks =c(sq[i], sq[i+1])))))

# [[1]]
# [1] 1 2 3 5

# [[2]]
# [1] 4 6

# [[3]]
# [1] 7

It does not produce any duplicate. (no need to use unique as it is the case for @nicola's answer)

It works as follows, in sapply, first we search for elements within the range [0.6, 1.4], then for [1.4, 2.2] and finally [2.2, 3.0].

Upvotes: 0

aichao
aichao

Reputation: 7445

An alternative using nn2 from RANN to find nearest neighbors within radius for clustering:

library(RANN)

x <- c(1, 1, 1.1, 2, 1, 2.1, 2.6)
tol=0.4
nn <- nn2(x,x,k=length(x),searchtype="radius",radius=tol)
m <- unique(apply(nn$nn.idx,1,sort), MARGIN=2)
sapply(seq_len(ncol(m)), function(i) m[which(m[,i] > 0),i])
##[[1]]
##[1] 1 2 3 5
##
##[[2]]
##[1] 4 6
##
##[[3]]
##[1] 7

x <- c(1, 1.3, 1.6)
nn <- nn2(x,x,k=length(x),searchtype="radius",radius=tol)
m <- unique(apply(nn$nn.idx,1,sort), MARGIN=2)  
sapply(seq_len(ncol(m)), function(i) m[which(m[,i] > 0),i])
##[[1]]
##[1] 1 2
##
##[[2]]
##[1] 1 2 3
##
##[[3]]
##[1] 2 3

Notes:

  1. The call to nn2 finds all nearest neighbors for each element of x with respect to all elements of x within a radius equalling the tol. The result nn$nn.idx is a matrix whose rows contain the indices that are nearest neighbors for each element in x. The matrix is dense and filled with zeroes as needed.
  2. Clustering is performed by sorting each row so that unique rows can be extracted. The output m is a matrix where each column contains the indices in a cluster. Again, this matrix is dense and filled with zeroes as needed.
  3. The resulting list is extracted by subsetting each column to remove the zero entries.

This is likely more efficient for large x because nn2 uses a KD-Tree, but it suffers from the same issue for elements that overlap (with respect to the tolerance) as pointed out by nicola.

Upvotes: 2

agenis
agenis

Reputation: 8377

Maybe it's a hammer to kill a mosquito, but i thought of univariate density clustering: the dbscan library enables you to do exactly that:

library(dbscan)
groups <- dbscan(as.matrix(x), eps=tol, minPts=1)$cluster
#### [1] 1 1 1 2 1 2 3

You don't neek to know in advance the number of groups.

It gives you the cluster number in output but you can if you prefer, take the groups means and round them to the closest integer. Once you've got this, you generate the list for instance like this:

    split(seq_along(x), groups)
#### $`1`
#### [1] 1 2 3 5
#### ...

Edit: Behaviour with overlapping: This algo attributes the same group to all elements that are within the range of tolerance of one other (works by proximity). So you might end up with fewer groups than expected if there is overlapping.

Upvotes: 1

nicola
nicola

Reputation: 24490

Not 100% reliable, since it uses unique on lists, but you can try:

unique(apply(outer(x,x,function(a,b) abs(a-b)<tol),1,which))
#[[1]]
#[1] 1 2 3 5
#
#[[2]]
#[1] 4 6
#
#[[3]]
#[1] 7

The point @Roland raised in the comments showed that there is some ambiguity in your requirements. For instance if x<-c(1, 1.3, 1.6), my line gives three groups: 1-2, 2-3 and 1-2-3. This because, from the 1 point of view, it is similar only to 1.3, but from 1.3 point of view, it is similar to both 1 and 1.6.

Upvotes: 2

Related Questions