In R what is an efficient way of finding the indices of the start and finish of sequences of 1 or more numbers that increase by 1

Question

I have a vector of numbers:
SampleVector <- c(2,4,7,8,9,12,14,16,17,19,23,24,25,26,27,29)
I want to find the indices of elements at the start and finish of sequences that increase by 1, but I also want the indices of elements that are not part of a sequence.
Another way of saying the same thing: I want the indices of all elements that are not inside single-step sequences.
For the SampleVector, the indices I want are:
DesiredIndices <- c(1,2,3,5,6,7,8,9,10,11,15,16)
That is, everything except the number 8 (as it is in the 7:9 sequence) and the numbers 24, 25and 26 (as they are within the 23:27 sequence.
My best attempt so far is:

SequenceStartAndEndIndices <- function(vector){
  DifferenceVector          <- diff(vector)
  DiffRunLength             <- rle(DifferenceVector)
  IndicesOfSingleElements   <- which(DifferenceVector > 1) + 1
  IndicesOfEndOfSequences   <- cumsum(DiffRunLength$lengths)[which((DiffRunLength$lengths * DiffRunLength$values) == DiffRunLength$lengths)] + 1
  IndicesOfStartsOfSequences<- c(1,head(IndicesOfEndOfSequences+1,-1))
  UniqueIndices             <- unique(c(IndicesOfStartsOfSequences,IndicesOfEndOfSequences,IndicesOfSingleElements))
  SortedIndices             <- UniqueIndices[order(UniqueIndices)]
  return(SortedIndices)
}

This function gives me the correct answer:

> SequenceStartAndEndIndices(vector = SampleVector)
 [1]  1  2  3  5  6  7  8  9 10 11 15 16

..but it is almost impossible to follow, and it is not obvious how generally applicable it will be. Is there a better way, or maybe an existing function in a package somewhere?

As background, the purpose of this is to help parse a long vector of distance markers into something that is reasonably human readable, e.g. instead of "at kilometres: 1,8,9,10,11,13" I'll be able to provide "at kilometres: 1, 8 to 11 and 13".

Ronak Shah · Accepted Answer

You can try with tapply in base R to create groups of consecutive numbers.

SampleVector <- c(2,4,7,8,9,12,14,16,17,19,23,24,25,26,27,29)

toString(tapply(SampleVector, 
         cumsum(c(TRUE, diff(SampleVector) > 1)), function(x) {
          if(length(x) == 1) x else paste(x[1], x[length(x)], sep = ' to ')
}))

#[1] "2, 4, 7 to 9, 12, 14, 16 to 17, 19, 23 to 27, 29"

In R what is an efficient way of finding the indices of the start and finish of sequences of 1 or more numbers that increase by 1

Answers (2)

Related Questions