Reputation: 902
This relates to my previous question on identifying the occurrence of a value in a data frame per id. This time I am trying to identify consecutive measurements per id with a length of 4 or more.
Ex.
Below an example of the consecutive occurrence of w with the length of 4
id t1 t2 t3 t4 t5 t6
1 s s w w w w
For the same id an example of the consecutive occurrence of w with the length of 4 as well 4 non-w occurrences after the last w
id t3 t4 t5 t6 t7 t8 t9 t10
1 w w w w r s s s
I would like to save this in a df as:
id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
1 s s w w r s s s
The format of my dataset with and without consecutive w occurrences:
id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
1 s s w w w w r s s s #after t2 value there are 4 occ. of w and after t6 (last one) there are 4 non-w occ.
2 s w w w e w w s t v #no 4 consecutive w occurrence and no 4 non-w occurrence after t7
3 w w w w w d s s s r #5 occ. of w after t5
4 e w w w w w w w w w #9 occ. of w after t1
5 w e w w w w r r r r #4 occ. of w after t2 and 4 occ. of non-w after t6
6 w s w r w r w w s w #no 4 consecutive w occurance
Output:
id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
1 s s w w r s s s
3 w d s s s r
4 e w
5 w r r r r
How can I split this format to have to 2 df's one with the start and the one with the end values?
Ex.
df1:
id t1 t2 t3
1 s s w
as well the other start cases for id2, id3...
df2:
id t6 t7 t8 t9 t10
1 w r s s s
as well the other end cases for id2, id3...
Sample data:
df<-structure(list(id=c(1,2,3,4,5,6), t1=c("s","s","w","e","w","w"), t2=c("s","w","w","w","e","s"),t3 = c("w","w","w","w","w","w"),
t4 = c("w","w","w","w","w","r"), t5 = c("w","e","w","w","w","w"), t6 = c("w","w","d","w","w","r"),
t7= c("r","w","s","w","r","w"), t8 = c("s","s","s","w","r","w"), t9=c("s","t","s","w","r","s"), t10=c("s","v","r","w","r","w")), row.names = c(NA, 6L), class = "data.frame")
Codes that identify the start and endpoints based on w:
Start
(Not working consecutive time steps):
df1 <- df
df1[-1] <- t(apply(df[-1], 1, function(x) replace(x, seq_along(x) > match('w', x), '')))
df1<-df1[rowSums(df1 == 'w')!=0, ,drop = FALSE]
End
(Not working for consecutive time steps):
df2 <- df
df2[-1] <- t(apply(df[-1], 1, function(x) replace(x, seq_along(x) <= match('w', x), '')))
df2 <- df2[c(TRUE, colSums(df2[-2] != '') > 0)]
df2<-df2[rowSums(df2 == 'w')!=0, ,drop = FALSE]
Upvotes: 0
Views: 88
Reputation: 160447
For part 1 of your multi-part question, here's a solution:
myfunc <- function(s, len = 4, what = "w") {
r <- rle(s)
rlen <- length(r$values)
for (ind in rev(seq_len(rlen))) {
if (r$values[ind] != what) next
if (r$lengths[ind] < len) next
r$values <- c(
if (ind > 1) r$values[1:(ind - 1L)],
r$values[ind], "", r$values[ind],
if (ind < rlen) r$values[(ind + 1L):rlen])
r$lengths <- c(
if (ind > 1) r$lengths[1:(ind - 1L)],
1L, r$lengths[ind] - 2L, 1L,
if (ind < rlen) r$lengths[(ind + 1L):rlen])
}
rlen <- length(r$values)
if (r$lengths[1] == 1L && r$values[1] == what &&
rlen > 1 && r$values[2] == "") {
r$values[1] <- ""
}
if (r$lengths[rlen] == 1 && r$values[rlen] == what &&
rlen > 1 && r$values[rlen-1] == "") {
r$values[rlen] <- ""
}
inverse.rle(r)
}
Apply this to each row (sans id
):
out <- cbind(df[,1,drop=F], t(apply(df[,-1], 1, myfunc)))
colnames(out)[-1] <- colnames(df)[-1]
out
# id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
# 1 1 s s w w r s s s
# 2 2 s w w w e w w s t v
# 3 3 w d s s s r
# 4 4 e w
# 5 5 w e w w r r r r
# 6 6 w s w r w r w w s w
Compare with your expected output:
id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
1 s s w w r s s s
3 w d s s s r
4 e w
5 w r r r r
Difference:
w e w
, since there was no rule in your conditions that would suggest it needs to be reduced.Let's dive into one row of data to find out what's going on.
unlist(df[5,-1])
# t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
# "w" "e" "w" "w" "w" "w" "r" "r" "r" "r"
myfunc( unlist(df[5,-1]) )
r <- rle(s)
r
# Run Length Encoding
# lengths: Named int [1:4] 1 1 4 4
# values : Named chr [1:4] "w" "e" "w" "r"
(I've removed R's attr
attributes here for brevity.) This says that there are four distinct sequences. First comes 1 "w", then 1 "e", then 4 "w", then 4 "r". We're now going to iterate over each of them and, if they meet our constraints (match what
length len
), then we replace the values
and lengths
.
But we need to do it backwards, so that the next ind
value (iterating over the indices of each vector) has not shifted. Ergo, rev(...)
within the for
loop.
rlen <- length(r$values)
ind <- 4 # look at the last value/length first
if (r$values[ind] != what) next
ind <- 3
if (r$values[ind] != what) next # nope, keep going
if (r$lengths[ind] < len) next # nope, keep going
At this point, since we passed both if
statements, we now need to replace the $values
and $lengths
at [ind]
:
r$values[3]
("w"
) needs to be replaced with c("w", "", "w")
to indicate the first value, all middle values, and the last value;r$lengths[3]
(4 here) needs to be replaced with c(1, 4-2, 1)
, where the first 1 corresponds to the new values first "w"
; the 4-2
corresponds to the new ""
; and the second 1
is for the second "w"
in the replaced vectorThe if (...)
within the new vectors are so that we deal correctly with ind==1
or ind==rlen
, where [1-1]
and [rlen+1]
will not do what we want.
Okay, let's step out of the loop, we've replaced the values. THe current status of r
is:
r
# Run Length Encoding
# lengths: Named int [1:6] 1 1 1 2 1 4
# values : Named chr [1:6] "w" "e" "w" "" "w" "r"
(attr
removed.) Notice the changes: 1 w, 1 e, 1 w, 2 "", 1 w, 4 r.
Let's continue, in order to replace a first or last "w"
:
rlen <- length(r$values) # because it may have changed
if (r$lengths[1] == 1L && r$values[1] == what &&
rlen > 1 && r$values[2] == "") {
r$values[1] <- ""
}
Normally, if we started with 4 or more "w"
, then t1
would be "w"
, t2
would be ""
, etc. In your expected output, you want this leading "w"
removed, so the 4-part if
clause is to see if this condition exists, and remove it accordingly. (The fact that r$values[1:2]
are now the same does not matter.)
Similar for the end of the vector.
In this case, it did not meet either condition, so r
is unchanged from post-for
-loop.
Lastly, we call inverse.rle
which takes the numbers and counts and converts back into the vector.
inverse.rle(r)
# [1] "w" "e" "w" "" "" "w" "r" "r" "r" "r"
### perfectly equivalent to
rep(r$values, times = r$lengths)
# t1 t2 t6 t6 t10 t10 t10 t10
# "w" "e" "w" "" "" "w" "r" "r" "r" "r"
(Actually, inverse.rle
uses rep.int
, which is a little faster than rep
but with fewer features.)
Upvotes: 1