R - Subset dataframe based on a repeated sequence

Question

I am trying to subset a data frame based on specific sequence occurring in column v3. A sample of a dataframe:

v1 <- c(1:20)
v2 <- c(1,1,0,0,1,0,1,1,1,0,1,1,0,0,0,1,1,0,0,0)
v3 <- c(4,4,2,3,2,3,2,4,4,2,3,2,3,3,3,4,4,2,3,3)
my_df <- data.frame(v1,v2,v3)         # creating a dataframe

sample output for my_df

   v1 v2 v3
1   1  1  4
2   2  1  4
3   3  0  2
4   4  0  3
5   5  1  2
6   6  0  3
7   7  1  2
8   8  1  4
9   9  1  4
10 10  0  2
11 11  1  3
12 12  1  2
13 13  0  3
14 14  0  3
15 15  0  3
16 16  1  4
17 17  1  4
18 18  0  2
19 19  0  3
20 20  0  3

The output I am trying to achieve should look like this

1   1  1  4
2   2  1  4
3   3  0  2
8   8  1  4
9   9  1  4
10 10  0  2
16 16  1  4
17 17  1  4
18 18  0  2

So I want to subset my df according to sequence of 4 4 2 in column v3. What I tried so far is:

my_df[which(c(diff(v3))==-2),]

but this only extracts the middle four of the sequence 4 4 2 like

v1 v2 v3
 2  2  1  4
 9  9  1  4
17 17  1  4

Another option I tried:

m = match(v3, c(4,4,2))
> m
 [1]  1  1  3 NA  3 NA  3  1  1  3 NA  3 NA NA NA  1  1  3 NA NA  
> my_df[!is.na(m),]
   v1 v2 v3
1   1  1  4
2   2  1  4
3   3  0  2
5   5  1  2
7   7  1  2
8   8  1  4
9   9  1  4
10 10  0  2
12 12  1  2
16 16  1  4
17 17  1  4
18 18  0  2

This output gives me all 4 and 2 but not the sequence 4 4 2 that I want. Any help would be appreciated. I already achieved this in matlab with for and if loop but I am just wondering how I can solve this in R in a loopless way.

akrun · Accepted Answer

We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(my_df)). Using shift from data.table, we get the next element with type = "lead". As shift takes a vector of n, we specify n = 0:2, so that we get three columns with the n = 0 corresponds to the original 'v3' column and others 1st and 2nd next values. Then, paste the elements rowwise (do.call(paste0, ...), check whether it is equal to 442, get the index of TRUE values (which), use rep to replicate the index and add with 0:2 so that we get the index of the three rows for each index. This can be used to subset the original dataset rows.

library(data.table)
setDT(my_df)[my_df[, rep(which(do.call(paste0, shift(v3, 0:2,
                 type= "lead")) == 442), each = 3) + 0:2]]
#   v1 v2 v3
#1:  1  1  4
#2:  2  1  4
#3:  3  0  2
#4:  8  1  4
#5:  9  1  4
#6: 10  0  2
#7: 16  1  4
#8: 17  1  4
#9: 18  0  2

data

my_df <- structure(list(v1 = 1:20, v2 = c(1L, 1L, 0L, 0L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), v3 = c(4L, 
4L, 2L, 3L, 2L, 3L, 2L, 4L, 4L, 2L, 3L, 2L, 3L, 3L, 3L, 4L, 4L, 
2L, 3L, 3L)), .Names = c("v1", "v2", "v3"), class = "data.frame", 
row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20"))

R - Subset dataframe based on a repeated sequence

Answers (2)

data

Related Questions