R deleting rows based on condition (advanced)

Question

I had some very large data frames, I created a conditional column that would be a 10, 1, 60 or 70 depending on some specific values. Everything in between was 0. I then created a small data frame with every 10 and 70 position (start and end of each sequence I wanted) and managed to subset my data frames such that I was only left with whats between each 10 and 70 (including the 1, 60 and 0's between them).

The resulting data frame is almost the subset I require and the conditional column now looks as follows:

After each 70, I should get 10 again and repeat each cycle. However when I setup this process on a loop of multiple files overnight, the results included some unwanted data. As you can see above, there are rows between 70 and 10. I'd like to eliminate all those rows with 0 that are between the 70 and 10. In this example there are only a few rows between each non-zero value but my actual data has much more such that its not trivial.

So the resulting data frame from the above would be:

These data frames range from 1 - 10 million rows so I dont think setting up a look that looks at each row and iterates would be efficient. Any ideas?

UPDATE

I thought I had a solution using the same methodology as I'd already used to get here.

I tried:

test$t = NA
test$t = ifelse(shift(test$ff== 70) & test$ff != 10, 1, test$t)
test$t = ifelse(shift(test$ff== 10, type = "lead") & test$ff != 70, 1, test$t)
test2 = data.frame(s = test$t[c(T,F)], e =  test$t[c(F,T)])
Error in data.frame(s = test$t[c(T, F)], e = test$t[c(F, T)]) : 
  arguments imply differing number of rows: 44085, 44084

My idea was to find the row after each 'bad' 70 and before each 'bad' 10, then delete all positions between using test2. It failed because my data also has cases like:

Here there is no pairing 'bad' 10 for the 70. Thinking about it, if instead of trying to remove whats bad, and subset again using my my original method of only finding sequences from 10 - 70, I might solve it. The way I did that was as follows:

df2 = df[df$ff == 10 | df$ff == 70,]
test = data.frame(s = df2$datetime[c(T,F)], e =  df2$datetime[c(F,T)])
for(i in 1:length(test$s)){test$ss[i] = which(df$datetime == test$s[i])}
for(i in 1:length(test$s)){test$ee[i] = which(df$datetime == test$e[i])}
for(i in 1:length(test$s)){df$t[test$ss[i]:test$ee[i]] = 1}
df_all= df[df$t ==1,]

user3603486 · Accepted Answer

Advanced, huh?

zoo::na.locf is your friend here:

condition <- as.numeric(c("10", "0", "0", "1", "0", "0", "0", "60", 
      "0", "0", "70", "10", "0", "0", "1", "0", "60", "0", "0", "70",
      "0", "0", "10"))

dfr <- data.frame(condition = condition)

dfr$between <- NA
dfr$between[dfr$condition == 70] <- TRUE
dfr$between[dfr$condition == 10] <- FALSE
dfr$between <- zoo::na.locf(dfr$between, na.rm = FALSE)
dfr$between[dfr$condition == 70] <- FALSE # don't remove the 70s themselves

dfr[! dfr$between, ]

R deleting rows based on condition (advanced)

Answers (2)

Related Questions