jerome
jerome

Reputation: 91

Select rows from a dataframe based on a pattern formed by consecutive values in a column

I'm average at R, and I would like some help with the following operation.

Let's say I have the following dataframe:

    >df
    ID   Label
    P1   M
    P1   S
    P2   M
    P2   M
    P2   S
    P3   M
    P3   S
    P3   M
    P4   S
    P4   M
    P5   M
    P5   M
    P5   S

I want to be able to select rows that occur in a particular sequence of the variable Label with respect to each ID.

For a pattern "MS", the expected output would be

    ID   Label
    P1   M
    P1   S
    P2   M
    P2   S
    P3   M
    P3   S

and for a pattern "MMS", the expected output would be

    ID   Label
    P2   M
    P2   M
    P2   S
    P5   M
    P5   M
    P5   S

and for pattern "SM", the expected output is:

    ID   Label
    P3   S
    P3   M
    P4   S
    P4   M

Please consider the fact that the data I am working on has a lot of rows and the solution I need to build needs to work for patterns of arbitrary lengths. (ex. "MSS", "SM", "MMSSMS", etc). I humbly request for your assistance.

Edit: I have updated this question (the example dataframe and the output example of pattern "MMS". I want to add that I want the pattern matching to happen after grouping the data using ID variable so that patterns can be found in the groups of data grouped by ID. Sorry for not being clear the first time.

Final Edit: Answers from @akrun, @boski, and @tmfmnk are working for me. @boski's and @akrun's solutions were faster in execution time (~2-10 seconds on 400k rows of data) compared to @tmfmnk's solution (~29 seconds on 400k rows of data). I advise the reader to refer all three of these solutions.

Upvotes: 5

Views: 1034

Answers (3)

tmfmnk
tmfmnk

Reputation: 39858

One base solution for the original question could be:

nchar <- nchar("MS")
x <- grepRaw("MS", paste(df$Label, collapse = ""), all = TRUE)
y <- rep(x, each = nchar) + 0:(nchar - 1)

df[1:nrow(df) %in% y, ]

  ID Label
1 P1     M
2 P1     S
4 P2     M
5 P2     S
6 P3     M
7 P3     S

nchar <- nchar("SM")
x <- grepRaw("SM", paste(df$Label, collapse = ""), all = TRUE)
y <- rep(x, each = nchar) + 0:(nchar - 1)

df[1:nrow(df) %in% y, ]

   ID Label
2  P1     S
3  P2     M
5  P2     S
6  P3     M
9  P4     S
10 P4     M

Or written in a dplyr form:

nchar <- nchar("MS")
df %>%
 filter(row_number() %in% c(rep(grepRaw("MS", paste(Label, collapse = ""), all = TRUE), 
            each = nchar) + 0:(nchar - 1)))

   ID Label
1  P1     M
2  P1     S
3  P2     M
4  P2     S
5  P3     M
6  P3     S
7  P3     M
8  P4     S
9  P5     M
10 P5     S

Also addressing the edit of the question:

nchar <- nchar("MS")
df %>%
 group_by(ID) %>%
 filter(row_number() %in% c(rep(grepRaw("MS", paste(Label, collapse = ""), all = TRUE), 
            each = nchar) + 0:(nchar - 1)))

  ID    Label
  <fct> <fct>
1 P1    M    
2 P1    S    
3 P2    M    
4 P2    S    
5 P3    M    
6 P3    S    
7 P5    M    
8 P5    S  

Upvotes: 1

akrun
akrun

Reputation: 887088

One option would be to compare the lead values and get the index grouped by 'ID'

library(data.table)
i1 <- unique(setDT(df)[, lapply(which(Reduce(`&`, 
  Map(`==`, shift(Label, n = 0:2, type = "lead"), c("M", "M", "S")))), 
       function(i) .I[i:(i+2)]) , by = ID]$V1)
df[i1]
#    ID Label
#1: P2     M
#2: P2     M
#3: P2     S
#4: P5     M
#5: P5     M
#6: P5     S

data

df <- structure(list(ID = c("P1", "P1", "P2", "P2", "P2", "P3", "P3", 
"P3", "P4", "P4", "P5", "P5", "P5"), Label = c("M", "S", "M", 
"M", "S", "M", "S", "M", "S", "M", "M", "M", "S")), 
class = "data.frame", row.names = c(NA, -13L))

Upvotes: 2

boski
boski

Reputation: 2467

You can try using gregexpr(). First paste all labels and find the start position of the pattern you are looking for.

> df
   ID Label
1  P1     M
2  P1     S
3  P2     M
4  P2     M
5  P2     S
6  P3     M
7  P3     S
8  P3     S
9  P4     S
10 P4     M
11 P5     M
12 P5     M
13 P5     S

Edit

My previous solution did not retrieve the whole pattern (just the start).

pattern="SM"
starts=gregexpr(pattern=pattern,paste(df$Label,collapse=""))[[1]]
positions=as.vector(sapply(starts,function(x){ 
  s=seq(x,x+nchar(pattern)-1)
  if (all(df$ID[s]==df$ID[x])){
    return(s)
  } else {return(rep(NA,nchar(pattern)))}
  }))
positions=positions[which(!is.na(positions))]

df[positions,]
df[positions,]
   ID Label
1  P1     M
2  P1     S
4  P2     M
5  P2     S
6  P3     M
7  P3     S
12 P5     M
13 P5     S

pattern="MMS"
   ID Label
3  P2     M
4  P2     M
5  P2     S
11 P5     M
12 P5     M
13 P5     S

pattern="SM"
   ID Label
9  P4     S
10 P4     M

Upvotes: 1

Related Questions