Reputation: 85
I have a vector in R:
data <- c(1,4,6,7,8,9,20,30,31,32,33,34,35,60)
What I want is to find the start and end of a successive stretch longer than 3 successive values. i.e.:
start end
3 6 (stretch 6-9)
8 13 (stretch 30-35
I have no clue how to get there.
Upvotes: 2
Views: 787
Reputation: 66819
From @eddi's answer to my similar question...
runs = split(seq_along(data), cumsum(c(0, diff(data) > 1)))
lapply(runs[lengths(runs) > 1], range)
# $`2`
# [1] 3 6
#
# $`4`
# [1] 8 13
How it works:
seq_along(data)
are the indices of data
, from 1..length(data)c(0, diff(data) > 1)
is has a 1 at each index where data
"jumps"cumsum(c(0, diff(data) > 1))
is an identifier for consecutive runs between jumpsSo runs
is a division of data
's indices into runs where data
's values are consecutive.
Upvotes: 5
Reputation: 19005
Here's a base R solution relying heavily on ?diff
:
data <- c(1,4,6,7,8,9,20,30,31,32,33,34,35,60)
diff1 <- diff(data[1:(length(data)-1)]) # lag 1 difference
diff2 <- diff(data, 2) # lag 2 difference
# indices of starting consecutive stretches -- these will overlap
start_index <- which(diff1==1 & diff2==2)
end_index <- start_index + 2
# notice that these overlap:
data.frame(start_index, end_index)
# To remove overlap:
# We can remove *subsequent* consecutive start indices
# and *initial* consecutive end indices
start_index_new <- start_index[which(c(0, diff(start_index))!=1)]
end_index_new <- end_index[which(c(diff(end_index), 0) != 1)]
data.frame(start_index_new, end_index_new)
# start_index_new end_index_new
# 1 3 6
# 2 8 13
Cory's answer is great -- this one might just be a little easier to understand because you're basically checking for cases where, from position i
, position i+1
has a value of 1 more and position i + 2
has a value of 2 more. You build ranges off of this and then consolidate your ranges with another diff
function. To my thinking this is a bit simpler.
There also are packages you can use like zoo
that can help you get rolling differences.
Upvotes: 0
Reputation: 6659
So, first take the diff
of a and do a run length sequence on it. Then, the starting points are the index before the 2s and the ending points are the negative differences of those... it's hard to explain, just step through the code and check it out. This does not find sequences of two... like (3,4) in (1, 3, 4, 7, 9). I had to include the remove
part for sequences that were off by two... (1, 3, 5, 7). Those weren't caught correctly. Any how, fun exercise. I hope somebody can do better. This is a bit of a mess...
data <- c(1,4,6,7,8,9,20,30,31,32,33,34,35,60)
a <- sequence(rle(diff(data))$lengths)
starts <- which(a==2) - 1
ends <- which(diff(a)<0) + 1
remove <- starts[starts %in% (ends-2)]
starts <- starts[!starts %in% remove]
ends <- ends[!ends %in% (remove+2)]
if(length(ends) < length(starts)) ends <- c(ends, length(data))
> starts
[1] 3 8
> ends
[1] 6 13
>
Upvotes: 0