Reputation: 727
say I have vector x
x <- c(1, 1, 1.1, 2, 1, 2.1, 2.6)
tol <- 0.4
how do I get the indices of the groups of elements that are 'unique' within the tolerance range (tol
) as in the list below. I don't know how many of these groups there are beforehand.
[[1]]
[1] 1 2 3 5
[[2]]
[1] 4 6
[[3]]
[1] 7
thanks
Upvotes: 6
Views: 1017
Reputation: 12937
Here is another attempt with cut
function from base R. We first try to create the range vector named sq
and then go through x
elements that falls within any specific range.
sq <- seq(min(x)-tol,max(x)+tol*2,tol*2)
# [1] 0.6 1.4 2.2 3.0
sapply(1:(length(sq)-1), function(i) which(!is.na(cut(x, breaks =c(sq[i], sq[i+1])))))
# [[1]]
# [1] 1 2 3 5
# [[2]]
# [1] 4 6
# [[3]]
# [1] 7
It does not produce any duplicate. (no need to use unique
as it is the case for @nicola's answer)
It works as follows, in sapply
, first we search for elements within the range [0.6, 1.4]
, then for [1.4, 2.2]
and finally [2.2, 3.0]
.
Upvotes: 0
Reputation: 7445
An alternative using nn2
from RANN
to find nearest neighbors within radius for clustering:
library(RANN)
x <- c(1, 1, 1.1, 2, 1, 2.1, 2.6)
tol=0.4
nn <- nn2(x,x,k=length(x),searchtype="radius",radius=tol)
m <- unique(apply(nn$nn.idx,1,sort), MARGIN=2)
sapply(seq_len(ncol(m)), function(i) m[which(m[,i] > 0),i])
##[[1]]
##[1] 1 2 3 5
##
##[[2]]
##[1] 4 6
##
##[[3]]
##[1] 7
x <- c(1, 1.3, 1.6)
nn <- nn2(x,x,k=length(x),searchtype="radius",radius=tol)
m <- unique(apply(nn$nn.idx,1,sort), MARGIN=2)
sapply(seq_len(ncol(m)), function(i) m[which(m[,i] > 0),i])
##[[1]]
##[1] 1 2
##
##[[2]]
##[1] 1 2 3
##
##[[3]]
##[1] 2 3
Notes:
nn2
finds all nearest neighbors for each element of x
with respect to all elements of x
within a radius equalling the tol
. The result nn$nn.idx
is a matrix whose rows contain the indices that are nearest neighbors for each element in x
. The matrix is dense and filled with zeroes as needed.m
is a matrix where each column contains the indices in a cluster. Again, this matrix is dense and filled with zeroes as needed.This is likely more efficient for large x
because nn2
uses a KD-Tree, but it suffers from the same issue for elements that overlap (with respect to the tolerance) as pointed out by nicola.
Upvotes: 2
Reputation: 8377
Maybe it's a hammer to kill a mosquito, but i thought of univariate density clustering: the dbscan
library enables you to do exactly that:
library(dbscan)
groups <- dbscan(as.matrix(x), eps=tol, minPts=1)$cluster
#### [1] 1 1 1 2 1 2 3
You don't neek to know in advance the number of groups.
It gives you the cluster number in output but you can if you prefer, take the groups means and round them to the closest integer. Once you've got this, you generate the list for instance like this:
split(seq_along(x), groups)
#### $`1`
#### [1] 1 2 3 5
#### ...
Edit: Behaviour with overlapping: This algo attributes the same group to all elements that are within the range of tolerance of one other (works by proximity). So you might end up with fewer groups than expected if there is overlapping.
Upvotes: 1
Reputation: 24490
Not 100% reliable, since it uses unique
on list
s, but you can try:
unique(apply(outer(x,x,function(a,b) abs(a-b)<tol),1,which))
#[[1]]
#[1] 1 2 3 5
#
#[[2]]
#[1] 4 6
#
#[[3]]
#[1] 7
The point @Roland raised in the comments showed that there is some ambiguity in your requirements. For instance if x<-c(1, 1.3, 1.6)
, my line gives three groups: 1-2, 2-3 and 1-2-3. This because, from the 1
point of view, it is similar only to 1.3
, but from 1.3
point of view, it is similar to both 1
and 1.6
.
Upvotes: 2