zumbay
zumbay

Reputation: 17

How to select a specific amount of rows before and after predefined values

I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.

# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)

In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:

ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
  jj <- (i-2):(i+4)
  selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1

This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:

twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
  jj <- (i-2):(i+4)
  selection <- c(selection,jj)
}
df$selection[selection] <- 2

Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.

Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).

Any tips and suggestions are very welcome! Thanks!

Upvotes: 1

Views: 1084

Answers (1)

lmo
lmo

Reputation: 38500

This is easy enough with rep and its each argument.

df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2

Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.

Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like

which(abs(df$y - 2) < 0.001)

or whatever precision measure will work for your problem.

Upvotes: 1

Related Questions