amyotun
amyotun

Reputation: 39

How to extract all rows between start signal and end signal?

I have the following df and I would like to extract all rows based on the following start and end signals.

Start signal : When status changes from 1 to 0 End signal : When status changes from 0 to -1.

df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))

   time status
1     1      0
2     2      1
3     3      1
4     4      0
5     5      0
6     6      0
7     7     -1
8     8      0
9     9      1
10   10      0
11   11      0
12   12      0
13   13     -1
14   14      0

Desire:

   time status    
4     4      0
5     5      0
6     6      0
10   10      0
11   11      0
12   12      0

Upvotes: 4

Views: 193

Answers (4)

smci
smci

Reputation: 33940

We count start and end markers, then use those values and the cumulative-sum of (start - end) to filter rows. The (cumsum(start)-cumsum(end)>1) is a slight fiddle to avoid the cumulative counts being upset by row 2 which starts but doesn't end; otherwise row 14 would unwantedly get included.

require(dplyr)

df %>% mutate(start=(status==1), end=(status==-1)) %>%
       filter(!start & !end & (cumsum(start)-cumsum(end)>1) ) %>%
       select(-start, -end)

#   time status
# 1    4      0
# 2    5      0
# 3    6      0
# 4   10      0
# 5   11      0
# 6   12      0

Upvotes: 2

hrbrmstr
hrbrmstr

Reputation: 78792

Do you have some more data (or can you gen some more data you know the outcome of) to see if this/these generalize?

Two similar approaches:

library(stringr)

df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))

dfr <- rle(df$status)

# first approach 

find_seq_str <- function() {
  str_locate_all(paste(gsub("-1", "x", dfr$values), collapse=""), "10")[[1]][,2]
}

df[as.vector(sapply(find_seq_str(), 
  function(n) {
    i <- sum(dfr$lengths[1:(n-1)])
    tail(i:(i+dfr$lengths[n]), -1)
  })),]


# second approach

find_seq_ts <- function() {
  which(apply(embed(dfr$values, 2), 1, function(x) all(x == c(0, 1))))
}

df[as.vector(sapply(find_seq_ts(), 
  function(n) {
    i <- sum(dfr$lengths[1:(n)])+1
    head(i:(i+dfr$lengths[n+1]), -1)
  })),]

Both approaches need a run length encoding of the status vector.

The first does a single character replacement for -1 so we can make an unambiguous, contiguous string to then use str_locate to find the pairs that tell us when the target sequence starts then rebuilds the ranges of zeroes from the rle lengths.

If it needs to be base R I can try to whip up something with regexpr.

The second builds a paired matrix and compares for the same target sequence.

Caveats:

  • I did no benchmarking
  • Both create potentially big things if status is big.
  • I'm not completely positive it generalizes (hence my initial q).
  • David's is far more readable, maintainable & transferrable code but you get to deal with all the "goodness" that comes with using data.table ;-)

I wrapped the approaches in functions as they could potentially then be parameterized, but you could just as easily just assign the value to a variable or shove it into the sapply (ugh, tho).

Upvotes: 1

David Arenburg
David Arenburg

Reputation: 92282

Here's a possible solution using the data.table package. I'm basically first grouping by status == 1 appearances and then checking per group if there was also a status == -1, if so, I'm sub-setting the group from the second incident until the -1 incident minus 1

library(data.table)
setDT(df)[, indx := cumsum(status == 1)]
df[, if(any(status == -1)) .SD[2:(which(status == -1) - 1)], by = indx]
#    indx time status
# 1:    2    4      0
# 2:    2    5      0
# 3:    2    6      0
# 4:    3   10      0
# 5:    3   11      0
# 6:    3   12      0 

Upvotes: 6

devmacrile
devmacrile

Reputation: 460

A little ugly, but you can always just loop over the values and keep a flag for determining whether the element should be kept or not.

keepers <- rep(FALSE, nrow(df))
flag <- FALSE
for(i in 1:(nrow(df)-1)) {
    if(df$status[i] == 1 && df$status[i+1] == 0) { 
        flag <- TRUE
        next  # keep signal index false
    }
    if(df$status[i] == -1 && df$status[i+1] == 0) {
        flag <- FALSE
        next  # keep signal index false
    }
    keepers[i] <- flag
}
keepers[nrow(df)] <- flag  # Set the last element to final flag value
newdf <- df[keepers, ]  # subset based on the T/F values determined

Upvotes: 1

Related Questions