Identify start and end time of a value per id in a data frame

Question

This relates to my previous question on identifying the occurrence of a value in a data frame per id. This time I am trying to identify consecutive measurements per id with a length of 4 or more.

Ex.

Below an example of the consecutive occurrence of w with the length of 4

 id t1 t2 t3 t4 t5 t6
 1  s  s  w  w  w  w

For the same id an example of the consecutive occurrence of w with the length of 4 as well 4 non-w occurrences after the last w

  id   t3 t4 t5 t6 t7 t8 t9 t10
   1    w  w  w  w  r  s  s  s

I would like to save this in a df as:

 id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
  1  s  s  w        w  r  s  s  s

The format of my dataset with and without consecutive w occurrences:

    id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
    1  s  s  w  w  w  w  r  s  s  s #after t2 value there are 4 occ. of w and after t6 (last one) there are 4 non-w occ.      
    2  s  w  w  w  e  w  w  s  t  v  #no 4 consecutive w occurrence and no 4 non-w occurrence after t7
    3  w  w  w  w  w  d  s  s  s  r #5 occ. of w after t5
    4  e  w  w  w  w  w  w  w  w  w #9 occ. of w after t1
    5  w  e  w  w  w  w  r  r  r  r #4 occ. of w after t2 and 4 occ. of non-w after t6
    6  w  s  w  r  w  r  w  w  s  w #no 4 consecutive w occurance

Output:

id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
1   s  s  w       w  r  s  s  s
3              w  d  s  s  s  r 
4   e  w
5                 w  r  r   r  r

How can I split this format to have to 2 df's one with the start and the one with the end values?

Ex.

df1:

  id t1 t2 t3 
   1  s  s  w

as well the other start cases for id2, id3...

df2:

id   t6 t7 t8 t9 t10
1     w  r  s  s  s

as well the other end cases for id2, id3...

Sample data:

df<-structure(list(id=c(1,2,3,4,5,6), t1=c("s","s","w","e","w","w"), t2=c("s","w","w","w","e","s"),t3 = c("w","w","w","w","w","w"),
                        t4 = c("w","w","w","w","w","r"), t5 = c("w","e","w","w","w","w"), t6 = c("w","w","d","w","w","r"),
                       t7= c("r","w","s","w","r","w"), t8 = c("s","s","s","w","r","w"), t9=c("s","t","s","w","r","s"), t10=c("s","v","r","w","r","w")), row.names = c(NA, 6L), class = "data.frame")

Codes that identify the start and endpoints based on w:

Start (Not working consecutive time steps):

df1 <- df
df1[-1] <- t(apply(df[-1], 1, function(x) replace(x, seq_along(x) > match('w', x), '')))
df1<-df1[rowSums(df1 == 'w')!=0,  ,drop = FALSE]

End (Not working for consecutive time steps):

df2 <- df
df2[-1] <- t(apply(df[-1], 1, function(x) replace(x, seq_along(x) <= match('w', x), '')))

df2 <- df2[c(TRUE, colSums(df2[-2] != '') > 0)]
df2<-df2[rowSums(df2 == 'w')!=0,  ,drop = FALSE]

r2evans · Accepted Answer

For part 1 of your multi-part question, here's a solution:

myfunc <- function(s, len = 4, what = "w") {
  r <- rle(s)
  rlen <- length(r$values)
  for (ind in rev(seq_len(rlen))) {
    if (r$values[ind] != what) next
    if (r$lengths[ind] < len) next
    r$values <- c(
      if (ind > 1) r$values[1:(ind - 1L)],
      r$values[ind], "", r$values[ind],
      if (ind < rlen) r$values[(ind + 1L):rlen])
    r$lengths <- c(
      if (ind > 1) r$lengths[1:(ind - 1L)],
      1L, r$lengths[ind] - 2L, 1L,
      if (ind < rlen) r$lengths[(ind + 1L):rlen])
  }
  rlen <- length(r$values)
  if (r$lengths[1] == 1L && r$values[1] == what &&
        rlen > 1 && r$values[2] == "") {
    r$values[1] <- ""
  }
  if (r$lengths[rlen] == 1 && r$values[rlen] == what &&
        rlen > 1 && r$values[rlen-1] == "") {
    r$values[rlen] <- ""
  }
  inverse.rle(r)
}

Apply this to each row (sans id):

out <- cbind(df[,1,drop=F], t(apply(df[,-1], 1, myfunc)))
colnames(out)[-1] <- colnames(df)[-1]
out
#   id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
# 1  1  s  s  w        w  r  s  s   s
# 2  2  s  w  w  w  e  w  w  s  t   v
# 3  3              w  d  s  s  s   r
# 4  4  e  w                         
# 5  5  w  e  w        w  r  r  r   r
# 6  6  w  s  w  r  w  r  w  w  s   w

Compare with your expected output:

id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
1   s  s  w       w  r  s  s  s
3              w  d  s  s  s  r 
4   e  w
5                 w  r  r  r  r

Difference:

my row 5 keeps the leading w e w, since there was no rule in your conditions that would suggest it needs to be reduced.

Let's dive into one row of data to find out what's going on.

unlist(df[5,-1])
#  t1  t2  t3  t4  t5  t6  t7  t8  t9 t10 
# "w" "e" "w" "w" "w" "w" "r" "r" "r" "r" 

myfunc( unlist(df[5,-1]) )

  r <- rle(s)
r
# Run Length Encoding
#   lengths: Named int [1:4] 1 1 4 4
#   values : Named chr [1:4] "w" "e" "w" "r"

(I've removed R's attr attributes here for brevity.) This says that there are four distinct sequences. First comes 1 "w", then 1 "e", then 4 "w", then 4 "r". We're now going to iterate over each of them and, if they meet our constraints (match what length len), then we replace the values and lengths.

But we need to do it backwards, so that the next ind value (iterating over the indices of each vector) has not shifted. Ergo, rev(...) within the for loop.

  rlen <- length(r$values)

ind <- 4 # look at the last value/length first
    if (r$values[ind] != what) next

ind <- 3
    if (r$values[ind] != what) next # nope, keep going
    if (r$lengths[ind] < len) next  # nope, keep going

At this point, since we passed both if statements, we now need to replace the $values and $lengths at [ind]:

the old value of r$values[3] ("w") needs to be replaced with c("w", "", "w") to indicate the first value, all middle values, and the last value;
the old length of r$lengths[3] (4 here) needs to be replaced with c(1, 4-2, 1), where the first 1 corresponds to the new values first "w"; the 4-2 corresponds to the new ""; and the second 1 is for the second "w" in the replaced vector

The if (...) within the new vectors are so that we deal correctly with ind==1 or ind==rlen, where [1-1] and [rlen+1] will not do what we want.

Okay, let's step out of the loop, we've replaced the values. THe current status of r is:

r
# Run Length Encoding
#   lengths: Named int [1:6] 1 1 1 2 1 4
#   values : Named chr [1:6] "w" "e" "w" "" "w" "r"

(attr removed.) Notice the changes: 1 w, 1 e, 1 w, 2 "", 1 w, 4 r.

Let's continue, in order to replace a first or last "w":

  rlen <- length(r$values) # because it may have changed
  if (r$lengths[1] == 1L && r$values[1] == what &&
        rlen > 1 && r$values[2] == "") {
    r$values[1] <- ""
  }

Normally, if we started with 4 or more "w", then t1 would be "w", t2 would be "", etc. In your expected output, you want this leading "w" removed, so the 4-part if clause is to see if this condition exists, and remove it accordingly. (The fact that r$values[1:2] are now the same does not matter.)

Similar for the end of the vector.

In this case, it did not meet either condition, so r is unchanged from post-for-loop.

Lastly, we call inverse.rle which takes the numbers and counts and converts back into the vector.

  inverse.rle(r)
#  [1] "w" "e" "w" ""  ""  "w" "r" "r" "r" "r"

### perfectly equivalent to
rep(r$values, times = r$lengths)
#  t1  t2  t6          t6 t10 t10 t10 t10 
# "w" "e" "w"  ""  "" "w" "r" "r" "r" "r"

(Actually, inverse.rle uses rep.int, which is a little faster than rep but with fewer features.)

Identify start and end time of a value per id in a data frame

Answers (1)

Related Questions