Stelios K.
Stelios K.

Reputation: 313

Selecting a sequence of random length starting and ending with specific values and limited by another column

I have a fairly large data set that has the form of the following table:

   value ID
1      0  A
2      0  A
3      1  A
4      1  A
5      0  A
6     -1  A
7      0  B
8      1  B
9      1  B
10     0  B
11     0  B
12     0  B
13     1  C
14     1  C
15     0  C
16     1  C
17     1  C
18     1  C
19     0  C

Essentially I'd like to transform the above, keeping only the first and last values of sequences that start with an occurrence of zero followed by a unknown number of ones and end at the last occurrence of one:

   value ID
2      0  A
4      1  A
7      0  B
9      1  B
15     0  C
18     1  C

Is there an easy way to accomplish this? dput of the first example follows:

structure(list(value = structure(c(2L, 2L, 3L, 3L, 2L, 1L, 2L, 
3L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 2L), .Label = c("-1", 
"0", "1"), class = "factor"), ID = structure(c(1L, 1L, 1L, 1L, 
 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label =  c("A", "B", "C"), class = "factor")), .Names = c("value", "ID"), row.names = c(NA, -19L), class = "data.frame")

Upvotes: 2

Views: 39

Answers (1)

David Arenburg
David Arenburg

Reputation: 92282

Here's my attempt using data.table and stringi packages combination

library(stringi)
library(data.table)
setDT(df)[, .(.I[stri_locate_all_regex(paste(value, collapse = ""), "01+")[[1]]], 0:1), by = ID]
#    ID V1 V2
# 1:  A  2  0
# 2:  A  4  1
# 3:  B  7  0
# 4:  B  9  1
# 5:  C 15  0
# 6:  C 18  1

This basically converts each group to a single string and then detects the beginning and the end of parts that match the 01+ regex while subsetting from the row index .I. Eventually I'm just adding 0:1 to the data (which seems redundant to me at least).

Upvotes: 3

Related Questions