Reputation: 23
I am trying to subset a data frame based on specific sequence occurring in column v3
.
A sample of a dataframe:
v1 <- c(1:20)
v2 <- c(1,1,0,0,1,0,1,1,1,0,1,1,0,0,0,1,1,0,0,0)
v3 <- c(4,4,2,3,2,3,2,4,4,2,3,2,3,3,3,4,4,2,3,3)
my_df <- data.frame(v1,v2,v3) # creating a dataframe
sample output for my_df
v1 v2 v3
1 1 1 4
2 2 1 4
3 3 0 2
4 4 0 3
5 5 1 2
6 6 0 3
7 7 1 2
8 8 1 4
9 9 1 4
10 10 0 2
11 11 1 3
12 12 1 2
13 13 0 3
14 14 0 3
15 15 0 3
16 16 1 4
17 17 1 4
18 18 0 2
19 19 0 3
20 20 0 3
The output I am trying to achieve should look like this
1 1 1 4
2 2 1 4
3 3 0 2
8 8 1 4
9 9 1 4
10 10 0 2
16 16 1 4
17 17 1 4
18 18 0 2
So I want to subset my df according to sequence of 4 4 2
in column v3
. What I tried so far is:
my_df[which(c(diff(v3))==-2),]
but this only extracts the middle four of the sequence 4 4 2
like
v1 v2 v3
2 2 1 4
9 9 1 4
17 17 1 4
Another option I tried:
m = match(v3, c(4,4,2))
> m
[1] 1 1 3 NA 3 NA 3 1 1 3 NA 3 NA NA NA 1 1 3 NA NA
> my_df[!is.na(m),]
v1 v2 v3
1 1 1 4
2 2 1 4
3 3 0 2
5 5 1 2
7 7 1 2
8 8 1 4
9 9 1 4
10 10 0 2
12 12 1 2
16 16 1 4
17 17 1 4
18 18 0 2
This output gives me all 4 and 2 but not the sequence 4 4 2
that I want. Any help would be appreciated.
I already achieved this in matlab with for and if loop but I am just wondering how I can solve this in R in a loopless way.
Upvotes: 2
Views: 1531
Reputation: 38510
As long as v3 does not have any missing values and the values of v3 are single characters, you can also use gregexpr
to accomplish this as follows
# get the row indices where the pattern 442 starts c(1 , 8, 16)
rowstarts <- unlist(gregexpr("442", paste(my_df$v3, collapse="")))
# extract rows from the data fram
dfNew <- my_df[sort(c(outer(rowstarts, (0:2), "+"))), ]
which returns
dfNew
v1 v2 v3
1 1 1 4
2 2 1 4
3 3 0 2
8 8 1 4
9 9 1 4
10 10 0 2
16 16 1 4
17 17 1 4
18 18 0 2
paste
with the collapse argument turns the vector v3 into a single character string. grexpr
then finds the starting position in this string for any "442" subexpression.
The final step subsets the data.frame using the outer
function suggested by @alexis-laz's in the comments above.
Upvotes: 2
Reputation: 887213
We can do this with data.table
. Convert the 'data.frame' to 'data.table' (setDT(my_df)
). Using shift
from data.table
, we get the next element with type = "lead"
. As shift
takes a vector of n
, we specify n = 0:2
, so that we get three columns with the n = 0
corresponds to the original 'v3' column and others 1st and 2nd next values. Then, paste
the elements rowwise (do.call(paste0, ...
), check whether it is equal to 442, get the index of TRUE values (which
), use rep
to replicate the index and add with 0:2 so that we get the index of the three rows for each index. This can be used to subset the original dataset rows.
library(data.table)
setDT(my_df)[my_df[, rep(which(do.call(paste0, shift(v3, 0:2,
type= "lead")) == 442), each = 3) + 0:2]]
# v1 v2 v3
#1: 1 1 4
#2: 2 1 4
#3: 3 0 2
#4: 8 1 4
#5: 9 1 4
#6: 10 0 2
#7: 16 1 4
#8: 17 1 4
#9: 18 0 2
my_df <- structure(list(v1 = 1:20, v2 = c(1L, 1L, 0L, 0L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), v3 = c(4L,
4L, 2L, 3L, 2L, 3L, 2L, 4L, 4L, 2L, 3L, 2L, 3L, 3L, 3L, 4L, 4L,
2L, 3L, 3L)), .Names = c("v1", "v2", "v3"), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20"))
Upvotes: 2