Reputation: 23
I have a vector of positions (basepairs in a genome for those interested) and I am trying to identify clusters of positions within that vector that fall beneath a given distance threshold.
So using a simple example, I want to identify positions in this vector where the distance between values is less than 20
bp = c(1, 20, 30, 100, 400, 410, 430, 500, 590, 690)
So far I have done this like so:
d <- 20
sapply(1:length(bp), function(z){
(bp[z + 1] - bp[z]) <= d
})
Where d is the distance threshold. This gives me a logical vector like so:
[1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE NA
What I would like is to identify these consecutive values of 'TRUE' as clusters, and name all other positions separately. Basically I would like to end with a data.frame like below:
bp cluster
1 1 a
2 20 a
3 30 a
4 100 b
5 400 c
6 410 c
7 430 c
8 500 d
9 590 e
10 690 f
Many thanks in advance for any suggestions.
Upvotes: 1
Views: 1835
Reputation: 92282
Here's a vectorized attempt, but you may be out of letters if there are too many groups so you may just stick with numeric values instead (remove the letters
part)
data.frame(bp, cluster = letters[cumsum(c(1L, diff(bp) > 20L))])
# bp cluster
# 1 1 a
# 2 20 a
# 3 30 a
# 4 100 b
# 5 400 c
# 6 410 c
# 7 430 c
# 8 500 d
# 9 590 e
# 10 690 f
Upvotes: 2
Reputation: 1127
Use number group instead of a-h group. However you can convert it to that way.
bp = c(1, 20, 30, 100, 400, 410, 430, 500, 590, 690)
f <- sapply(1:length(bp), function(z){ as.integer((bp[z] + d - 1) / d) })
data.frame(bp,cluster = f)
Upvotes: 0
Reputation: 11762
Kind of a poor mans approach is a for loop...
a <- diff(bp) < 20
b <- 1
d <- c()
for(l in a) {
if(l) {
d <- c(d, b)
} else{
b <- b + 1
d <- c(d, b)
}
}
Upvotes: 0