Tim
Tim

Reputation: 7464

Counting lengths of subseries

Imagine series of numbers like

c(21,22,23,30,31,32,34,35,36,37,38,50,NA,52)

where subseries are defined as: x[t] is a part of some subserie if x[t] = x[t-1] + 1?

So in the example above we have the following series:

c(21,22,23,30,31,32,34,35,36,37,38,50,NA,52)
## 1  1  1  2  2  2  3  3  3  3  3  4  -  5    # serie ID
##    3    |   3    |      5      | 1 | | 1    # length

What would be the most efficient way of tagging the subseries and counting their lengths (as a single function or two separate ones)?

Upvotes: 1

Views: 63

Answers (2)

Tim
Tim

Reputation: 7464

I'm accepting the answer by akrun (with contribution by David Arenburg), but for the reference I provide a Rcpp solution I created in the meantime.

NumericVector cpp_seriesLengths(NumericVector x) {
  int n = x.length();
  if (n == 1)
    return wrap(1);
  NumericVector out(n);
  int tmpCount = 1;
  int prevStart = 0;

  for (int i = 0; i < (n-1); i++) {
    if ( x[i] == (x[i+1] - 1) ) {
      tmpCount += 1;
    } else {
      for (int j = prevStart; j <= i; j++)
        out[j] = tmpCount;
      tmpCount = 1;
      prevStart = i+1;
    }
  }
  for (int j = prevStart; j < n; j++)
    out[j] = tmpCount;

  return out;
}

Upvotes: 1

akrun
akrun

Reputation: 887601

We can get the difference between the adjacent elements, check whether it is equal to 1, get the cumulative sum, and use that as group to get the length of the vector

unname(tapply(v1, cumsum(c(TRUE, diff(replace(v1, is.na(v1), 0))!=1)), length))
#[1] 3 3 5 1 1 1

If we need the NA elements as ""

unname(tapply(v1, cumsum(c(TRUE, diff(replace(v1, is.na(v1), 0))!=1)), 
            function(x) if(all(is.na(x))) "" else length(x)))
#[1] "3" "3" "5" "1" ""  "1"

Or a variation posted by @DavidArenburg with rle

rle(cumsum(c(TRUE, diff(replace(v1, is.na(v1), 0))!=1)))$lengths

Upvotes: 3

Related Questions