Olivia
Olivia

Reputation: 814

R deleting rows based on condition (advanced)

I had some very large data frames, I created a conditional column that would be a 10, 1, 60 or 70 depending on some specific values. Everything in between was 0. I then created a small data frame with every 10 and 70 position (start and end of each sequence I wanted) and managed to subset my data frames such that I was only left with whats between each 10 and 70 (including the 1, 60 and 0's between them).

The resulting data frame is almost the subset I require and the conditional column now looks as follows:

    condition
           10
            0
            0
            1
            0
            0
            0
            60
            0
            0
            70
            10
            0
            0
            1
            0
            60
            0
            0
            70
            0
            0
            10
            0
            0
            1
            0
            0
            60
            0
            0
            70
            0
            10
            0
            0
            1
            0
            0
            0
            0
            60
            0
            0
            0
            70

After each 70, I should get 10 again and repeat each cycle. However when I setup this process on a loop of multiple files overnight, the results included some unwanted data. As you can see above, there are rows between 70 and 10. I'd like to eliminate all those rows with 0 that are between the 70 and 10. In this example there are only a few rows between each non-zero value but my actual data has much more such that its not trivial.

So the resulting data frame from the above would be:

condition
       10
        0
        0
        1
        0
        0
        0
        60
        0
        0
        70
        10
        0
        0
        1
        0
        60
        0
        0
        70
        10
        0
        0
        1
        0
        0
        60
        0
        0
        70
        10
        0
        0
        1
        0
        0
        0
        0
        60
        0
        0
        0
        70

These data frames range from 1 - 10 million rows so I dont think setting up a look that looks at each row and iterates would be efficient. Any ideas?

UPDATE

I thought I had a solution using the same methodology as I'd already used to get here.

I tried:

test$t = NA
test$t = ifelse(shift(test$ff== 70) & test$ff != 10, 1, test$t)
test$t = ifelse(shift(test$ff== 10, type = "lead") & test$ff != 70, 1, test$t)
test2 = data.frame(s = test$t[c(T,F)], e =  test$t[c(F,T)])
Error in data.frame(s = test$t[c(T, F)], e = test$t[c(F, T)]) : 
  arguments imply differing number of rows: 44085, 44084

My idea was to find the row after each 'bad' 70 and before each 'bad' 10, then delete all positions between using test2. It failed because my data also has cases like:

70
0
0
60
0
0
70
10

Here there is no pairing 'bad' 10 for the 70. Thinking about it, if instead of trying to remove whats bad, and subset again using my my original method of only finding sequences from 10 - 70, I might solve it. The way I did that was as follows:

df2 = df[df$ff == 10 | df$ff == 70,]
test = data.frame(s = df2$datetime[c(T,F)], e =  df2$datetime[c(F,T)])
for(i in 1:length(test$s)){test$ss[i] = which(df$datetime == test$s[i])}
for(i in 1:length(test$s)){test$ee[i] = which(df$datetime == test$e[i])}
for(i in 1:length(test$s)){df$t[test$ss[i]:test$ee[i]] = 1}
df_all= df[df$t ==1,]

Upvotes: 1

Views: 268

Answers (2)

SunLisa
SunLisa

Reputation: 134

Assuming your condition data.frame is called test.

start_end<-data.frame(a=which(test$condition==10),b=which(test$condition==70))

Get a data.frame that mark each start and each ending

get a function to generate sequence

fun.seq<-function(a,b){
  return(seq(a,b,1))
}

mapply that function on each row of start_end and unlist it, and select the rows

test[unlist(mapply(fun.seq,start_end$a,start_end$b)),]
 [1] 10  0  0  1  0  0  0 60  0  0 70 10  0  0  1  0 60  0  0 70 10  0  0  1  0  0 60  0  0 70 10  0  0  1  0  0  0  0 60  0  0  0 70

I am not sure if it would be fast enough on millions rows.

Maybe data.table would be able to do this in a quicker way, but I'm not sure how to do it.

I came up with this very cheesey way, simple but it works.

output of mapply:

    mapply(fun.seq,start_end$a,start_end$b)
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10 11

[[2]]
[1] 12 13 14 15 16 17 18 19 20

[[3]]
 [1] 23 24 25 26 27 28 29 30 31 32

[[4]]
 [1] 34 35 36 37 38 39 40 41 42 43 44 45 46

Upvotes: 1

user3603486
user3603486

Reputation:

Advanced, huh?

zoo::na.locf is your friend here:

condition <- as.numeric(c("10", "0", "0", "1", "0", "0", "0", "60", 
      "0", "0", "70", "10", "0", "0", "1", "0", "60", "0", "0", "70",
      "0", "0", "10"))

dfr <- data.frame(condition = condition)

dfr$between <- NA
dfr$between[dfr$condition == 70] <- TRUE
dfr$between[dfr$condition == 10] <- FALSE
dfr$between <- zoo::na.locf(dfr$between, na.rm = FALSE)
dfr$between[dfr$condition == 70] <- FALSE # don't remove the 70s themselves

dfr[! dfr$between, ]

Upvotes: 1

Related Questions