Amgh
Amgh

Reputation: 23

R - Subset dataframe based on a repeated sequence

I am trying to subset a data frame based on specific sequence occurring in column v3. A sample of a dataframe:

v1 <- c(1:20)
v2 <- c(1,1,0,0,1,0,1,1,1,0,1,1,0,0,0,1,1,0,0,0)
v3 <- c(4,4,2,3,2,3,2,4,4,2,3,2,3,3,3,4,4,2,3,3)
my_df <- data.frame(v1,v2,v3)         # creating a dataframe

sample output for my_df

   v1 v2 v3
1   1  1  4
2   2  1  4
3   3  0  2
4   4  0  3
5   5  1  2
6   6  0  3
7   7  1  2
8   8  1  4
9   9  1  4
10 10  0  2
11 11  1  3
12 12  1  2
13 13  0  3
14 14  0  3
15 15  0  3
16 16  1  4
17 17  1  4
18 18  0  2
19 19  0  3
20 20  0  3

The output I am trying to achieve should look like this

1   1  1  4
2   2  1  4
3   3  0  2
8   8  1  4
9   9  1  4
10 10  0  2
16 16  1  4
17 17  1  4
18 18  0  2

So I want to subset my df according to sequence of 4 4 2 in column v3. What I tried so far is:

my_df[which(c(diff(v3))==-2),]

but this only extracts the middle four of the sequence 4 4 2 like

v1 v2 v3
 2  2  1  4
 9  9  1  4
17 17  1  4

Another option I tried:

m = match(v3, c(4,4,2))
> m
 [1]  1  1  3 NA  3 NA  3  1  1  3 NA  3 NA NA NA  1  1  3 NA NA  
> my_df[!is.na(m),]
   v1 v2 v3
1   1  1  4
2   2  1  4
3   3  0  2
5   5  1  2
7   7  1  2
8   8  1  4
9   9  1  4
10 10  0  2
12 12  1  2
16 16  1  4
17 17  1  4
18 18  0  2

This output gives me all 4 and 2 but not the sequence 4 4 2 that I want. Any help would be appreciated. I already achieved this in matlab with for and if loop but I am just wondering how I can solve this in R in a loopless way.

Upvotes: 2

Views: 1531

Answers (2)

lmo
lmo

Reputation: 38510

As long as v3 does not have any missing values and the values of v3 are single characters, you can also use gregexpr to accomplish this as follows

# get the row indices where the pattern 442 starts c(1 , 8, 16)
rowstarts <- unlist(gregexpr("442", paste(my_df$v3, collapse="")))

# extract rows from the data fram
dfNew <- my_df[sort(c(outer(rowstarts, (0:2), "+"))), ]

which returns

dfNew
   v1 v2 v3
1   1  1  4
2   2  1  4
3   3  0  2
8   8  1  4
9   9  1  4
10 10  0  2
16 16  1  4
17 17  1  4
18 18  0  2

paste with the collapse argument turns the vector v3 into a single character string. grexpr then finds the starting position in this string for any "442" subexpression.

The final step subsets the data.frame using the outer function suggested by @alexis-laz's in the comments above.

Upvotes: 2

akrun
akrun

Reputation: 887213

We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(my_df)). Using shift from data.table, we get the next element with type = "lead". As shift takes a vector of n, we specify n = 0:2, so that we get three columns with the n = 0 corresponds to the original 'v3' column and others 1st and 2nd next values. Then, paste the elements rowwise (do.call(paste0, ...), check whether it is equal to 442, get the index of TRUE values (which), use rep to replicate the index and add with 0:2 so that we get the index of the three rows for each index. This can be used to subset the original dataset rows.

library(data.table)
setDT(my_df)[my_df[, rep(which(do.call(paste0, shift(v3, 0:2,
                 type= "lead")) == 442), each = 3) + 0:2]]
#   v1 v2 v3
#1:  1  1  4
#2:  2  1  4
#3:  3  0  2
#4:  8  1  4
#5:  9  1  4
#6: 10  0  2
#7: 16  1  4
#8: 17  1  4
#9: 18  0  2

data

my_df <- structure(list(v1 = 1:20, v2 = c(1L, 1L, 0L, 0L, 1L, 0L, 1L, 
1L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), v3 = c(4L, 
4L, 2L, 3L, 2L, 3L, 2L, 4L, 4L, 2L, 3L, 2L, 3L, 3L, 3L, 4L, 4L, 
2L, 3L, 3L)), .Names = c("v1", "v2", "v3"), class = "data.frame", 
row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20"))

Upvotes: 2

Related Questions