Brynn O'donnell
Brynn O'donnell

Reputation: 9

Subset data based on sequence of characters across rows

How can I subset df by a pattern of consecutive rows of characters? In the example below, I'd like to subset the data that have history values of "TRUE", "FALSE", "TRUE" consecutively. The data below is a bit odd but you get the idea!

value <- c(1/1/16,1/2/16, 1/3/16, 1/4/16, 1/5/16, 1/6/16, 1/7/16, 1/8/16, 1/9/16, 1/10/16)

history <- c("TRUE", "FALSE", "TRUE", "TRUE", "FALSE", "TRUE", "TRUE", "TRUE", "FALSE", "TRUE")

df <- data.frame(value, history)
df

         value history  
1  0.062500000    TRUE  
2  0.031250000   FALSE  
3  0.020833333    TRUE  
4  0.015625000    TRUE  
5  0.012500000   FALSE  
6  0.010416667    TRUE  
7  0.008928571    TRUE  
8  0.007812500    TRUE  
9  0.006944444   FALSE  
10 0.006250000    TRUE  

I've tried grepl, but that works for character strings - not sequences of characters consecutively across rows.

The output would be the same df as above, but without row 7, as that doesn't follow the pattern mentioned.

Upvotes: 1

Views: 57

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 270248

The data in the question looks very strange so we used the data in the Note at the end. If you really have a character vector or factor with value "TRUE" and "FALSE" it can readily be translated to logicals using:

df <- transform(df, history = history == "TRUE")

1) rollapply First define the pattern and then search for it using a moving window with rollapplyr. That gives a logical vector which is TRUE if it is the end of such a pattern match. Find the indexes of the TRUEs and include the prior two indexes as well. Finally perform the subset.

library(zoo)

pattern <- c(TRUE, FALSE, TRUE)
ix <- which(rollapplyr(df$history, length(pattern), identical, pattern, fill = FALSE))
ix <- unique(sort(c(outer(ix, seq_along(pattern) - 1L, "-"))))
df[ix, ]

giving:

         value history
1  0.062500000    TRUE
2  0.031250000   FALSE
3  0.020833333    TRUE
4  0.015625000    TRUE
5  0.012500000   FALSE
6  0.010416667    TRUE
8  0.007812500    TRUE
9  0.006944444   FALSE
10 0.006250000    TRUE

1a) magrittr This code in (1) could be expressed using magrittr. (Solution (2) could also be expressed using magrittr following similar ideas.)

library(magrittr)
library(zoo)

df %>%
  extract(
   extract(.,, "history") %>%
   rollapplyr(length(pattern), identical, pattern, fill = FALSE) %>%
   which %>%
   outer(seq_along(pattern) - 1L, "-") %>%
   sort %>%
   unique, )

2) gregexpr Using pattern defined above we convert it to a character string of 0s and 1s and also convert df$history to such a string. We can then use gregexpr to find the indexes of the first element of each match and then expand that to all indexes and subset. We get the same answer as before. This alternative uses no packages.

collapse <- function(x) paste0(x + 0, collapse = "")
ix <- gregexpr(collapse(pattern), collapse(df$history))[[1]]
ix <- unique(sort(c(outer(ix, seq_along(pattern) - 1L, "+"))))
df[ix, ]

Note

Lines <- "
         value history  
1  0.062500000    TRUE  
2  0.031250000   FALSE  
3  0.020833333    TRUE  
4  0.015625000    TRUE  
5  0.012500000   FALSE  
6  0.010416667    TRUE  
7  0.008928571    TRUE  
8  0.007812500    TRUE  
9  0.006944444   FALSE  
10 0.006250000    TRUE"
df <- read.table(text = Lines)

Upvotes: 1

Nar
Nar

Reputation: 658

option using lag:

    df <- data.frame(value, history)

    n<- grepl("TRUE, FALSE, TRUE", paste(lag(lag(history)), (lag(history)), history, sep = ", "))[-(1:2)]

    cond <- n |lag(n)|lag(lag(n)) 
    cond <- c(cond, cond[length(history)-2], cond[length(history)-2])
    df[cond, ]

Upvotes: 0

Frank
Frank

Reputation: 66819

You could do...

s = c("TRUE", "FALSE", "TRUE")

library(data.table)
w = as.data.table(embed(history, length(s)))[as.list(s), on=paste0("V", seq_along(s)), which=TRUE]

df$v <- FALSE
df$v[w + rep(seq_along(s)-1L, each=length(s))] <- TRUE

         value history     v
1  0.062500000    TRUE  TRUE
2  0.031250000   FALSE  TRUE
3  0.020833333    TRUE  TRUE
4  0.015625000    TRUE  TRUE
5  0.012500000   FALSE  TRUE
6  0.010416667    TRUE  TRUE
7  0.008928571    TRUE FALSE
8  0.007812500    TRUE  TRUE
9  0.006944444   FALSE  TRUE
10 0.006250000    TRUE  TRUE

You can then filter like subset(df, v == TRUE).


This works using data.table joins, x[i, which=TRUE] which looks up i = as.list(s) in x = embed(history, length(s)) and reports which rows of x are matched:

> as.data.table(as.list(s))
     V1    V2   V3
1: TRUE FALSE TRUE

> as.data.table(embed(history, length(s)))
      V1    V2    V3
1:  TRUE FALSE  TRUE
2:  TRUE  TRUE FALSE
3: FALSE  TRUE  TRUE
4:  TRUE FALSE  TRUE
5:  TRUE  TRUE FALSE
6:  TRUE  TRUE  TRUE
7: FALSE  TRUE  TRUE
8:  TRUE FALSE  TRUE

The w + rep(...) is the same as @GGrothendieck's outer(...) except here w contains the position of the start of a match, not the end.

Upvotes: 1

Related Questions