Fenrir
Fenrir

Reputation: 85

How do I find ranges of successive numbers in a vector in R

I have a vector in R:

data <- c(1,4,6,7,8,9,20,30,31,32,33,34,35,60)

What I want is to find the start and end of a successive stretch longer than 3 successive values. i.e.:

start end
3  6  (stretch 6-9)
8 13 (stretch 30-35

I have no clue how to get there.

Upvotes: 2

Views: 787

Answers (3)

Frank
Frank

Reputation: 66819

From @eddi's answer to my similar question...

runs = split(seq_along(data), cumsum(c(0, diff(data) > 1)))
lapply(runs[lengths(runs) > 1], range)

# $`2`
# [1] 3 6
# 
# $`4`
# [1]  8 13

How it works:

  • seq_along(data) are the indices of data, from 1..length(data)
  • c(0, diff(data) > 1) is has a 1 at each index where data "jumps"
  • cumsum(c(0, diff(data) > 1)) is an identifier for consecutive runs between jumps

So runs is a division of data's indices into runs where data's values are consecutive.

Upvotes: 5

C8H10N4O2
C8H10N4O2

Reputation: 19005

Here's a base R solution relying heavily on ?diff:

data <- c(1,4,6,7,8,9,20,30,31,32,33,34,35,60)

diff1 <- diff(data[1:(length(data)-1)]) # lag 1 difference
diff2 <- diff(data, 2) # lag 2 difference

# indices of starting consecutive stretches -- these will overlap
start_index <- which(diff1==1 & diff2==2)
end_index <- start_index + 2

# notice that these overlap:
data.frame(start_index, end_index)

# To remove overlap:
# We can remove *subsequent* consecutive start indices
#           and *initial* consecutive end indices

start_index_new <- start_index[which(c(0, diff(start_index))!=1)]
end_index_new <- end_index[which(c(diff(end_index), 0) != 1)]
data.frame(start_index_new, end_index_new)

#   start_index_new end_index_new
# 1               3             6
# 2               8            13

Cory's answer is great -- this one might just be a little easier to understand because you're basically checking for cases where, from position i, position i+1 has a value of 1 more and position i + 2 has a value of 2 more. You build ranges off of this and then consolidate your ranges with another diff function. To my thinking this is a bit simpler.

There also are packages you can use like zoo that can help you get rolling differences.

Upvotes: 0

cory
cory

Reputation: 6659

So, first take the diff of a and do a run length sequence on it. Then, the starting points are the index before the 2s and the ending points are the negative differences of those... it's hard to explain, just step through the code and check it out. This does not find sequences of two... like (3,4) in (1, 3, 4, 7, 9). I had to include the remove part for sequences that were off by two... (1, 3, 5, 7). Those weren't caught correctly. Any how, fun exercise. I hope somebody can do better. This is a bit of a mess...

data <- c(1,4,6,7,8,9,20,30,31,32,33,34,35,60)
a <- sequence(rle(diff(data))$lengths)
starts <- which(a==2) - 1
ends <- which(diff(a)<0) + 1
remove <- starts[starts %in% (ends-2)]
starts <- starts[!starts %in% remove]
ends <- ends[!ends %in% (remove+2)]
if(length(ends) < length(starts)) ends <- c(ends, length(data))
> starts
[1] 3 8
> ends
[1]  6 13
> 

Upvotes: 0

Related Questions