Reputation: 313
I have a fairly large data set that has the form of the following table:
value ID
1 0 A
2 0 A
3 1 A
4 1 A
5 0 A
6 -1 A
7 0 B
8 1 B
9 1 B
10 0 B
11 0 B
12 0 B
13 1 C
14 1 C
15 0 C
16 1 C
17 1 C
18 1 C
19 0 C
Essentially I'd like to transform the above, keeping only the first and last values of sequences that start with an occurrence of zero followed by a unknown number of ones and end at the last occurrence of one:
value ID
2 0 A
4 1 A
7 0 B
9 1 B
15 0 C
18 1 C
Is there an easy way to accomplish this? dput of the first example follows:
structure(list(value = structure(c(2L, 2L, 3L, 3L, 2L, 1L, 2L,
3L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 2L), .Label = c("-1",
"0", "1"), class = "factor"), ID = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("value", "ID"), row.names = c(NA, -19L), class = "data.frame")
Upvotes: 2
Views: 39
Reputation: 92282
Here's my attempt using data.table
and stringi
packages combination
library(stringi)
library(data.table)
setDT(df)[, .(.I[stri_locate_all_regex(paste(value, collapse = ""), "01+")[[1]]], 0:1), by = ID]
# ID V1 V2
# 1: A 2 0
# 2: A 4 1
# 3: B 7 0
# 4: B 9 1
# 5: C 15 0
# 6: C 18 1
This basically converts each group to a single string and then detects the beginning and the end of parts that match the 01+
regex while subsetting from the row index .I
. Eventually I'm just adding 0:1
to the data (which seems redundant to me at least).
Upvotes: 3