Reputation: 23

R - how to identify consecutive rows satisfying a condition in a data frame?

I have a vector of positions (basepairs in a genome for those interested) and I am trying to identify clusters of positions within that vector that fall beneath a given distance threshold.

So using a simple example, I want to identify positions in this vector where the distance between values is less than 20

bp = c(1, 20, 30, 100, 400, 410, 430, 500, 590, 690)

So far I have done this like so:

d <- 20
sapply(1:length(bp), function(z){ 
  (bp[z + 1] - bp[z]) <= d
})

Where d is the distance threshold. This gives me a logical vector like so:

 [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE    NA

What I would like is to identify these consecutive values of 'TRUE' as clusters, and name all other positions separately. Basically I would like to end with a data.frame like below:

    bp cluster
1    1       a
2   20       a
3   30       a
4  100       b
5  400       c
6  410       c
7  430       c
8  500       d
9  590       e
10 690       f

Many thanks in advance for any suggestions.

Upvotes: 1

Answers (3)

David Arenburg

Reputation: 92300

Here's a vectorized attempt, but you may be out of letters if there are too many groups so you may just stick with numeric values instead (remove the letters part)

data.frame(bp, cluster = letters[cumsum(c(1L, diff(bp) > 20L))])
#     bp cluster
# 1    1       a
# 2   20       a
# 3   30       a
# 4  100       b
# 5  400       c
# 6  410       c
# 7  430       c
# 8  500       d
# 9  590       e
# 10 690       f

Upvotes: 2

Yong

Reputation: 1127

Use number group instead of a-h group. However you can convert it to that way.

    bp = c(1, 20, 30, 100, 400, 410, 430, 500, 590, 690)
    f <- sapply(1:length(bp), function(z){ as.integer((bp[z] + d - 1) / d) })
    data.frame(bp,cluster = f)

Upvotes: 0

drmariod

Reputation: 11782

Kind of a poor mans approach is a for loop...

a <- diff(bp) < 20
b <- 1
d <- c()
for(l in a) {
  if(l) {
    d <- c(d, b)
  } else{
    b <- b + 1
    d <- c(d, b)
  }
}

Upvotes: 0

R - how to identify consecutive rows satisfying a condition in a data frame?

Answers (3)

Related Questions