Reputation: 2126
I have a lot of data frames, each with several columns. Two of these columns are time
and value
.
Minimal example
library(tidyverse)
df <- approx(seq(1,10,1), c(1,5,7,11,4,12,30, 20, 10, 9)) %>%
as.data.frame() %>%
rename(time = x, value = y)
Goal
I want to remove all rows from each data frame, starting at the first time value > 10
.
When the data frame contains values > 10
, a solution would be the following:
df <- df %>%
filter(row_number() <= first(which(value > 10))-1)
However, there are also data frames where the value
does not exceed 10
, e.g.,
df <- approx(seq(1,10,1), c(1,5,7,1,4,2,1, 2, 1, 9)) %>%
as.data.frame() %>%
rename(time = x, value = y)
In this case, the data frame should not be filtered (because the value
threshold is not reached). When I use the filter
solution from above, however, it returns an empty data frame.
Question
How would you solve this problem inside a dplyr
pipe? Is it possible to do conditional filtering?
Upvotes: 0
Views: 3456
Reputation: 388807
You could write a conditional statement in filter
:
library(dplyr)
df %>%
filter(if(any(value > 10)) row_number() <= which.max(value > 10)-1 else TRUE)
Writing the same logic in slice
:
df %>%
slice(if(any(value > 10)) seq_len(which.max(value > 10)-1) else seq_len(n()))
Microbenchmarking
In terms of speed, there isn't a large difference between filter
and slice
:
df <- approx(seq(1,10^5,1),
round( runif(10^5, min = 1, max = 10^10) ) ) %>%
as.data.frame()
library(microbenchmark)
microbenchmark(
filter = df %>% filter(if(any(value > 10)) row_number() <= which.max(value > 10)-1 else TRUE),
slice = df %>% slice(if(any(value > 10)) seq_len(which.max(value > 10)-1) else seq_len(n())), times = 10000)
Unit: microseconds
expr min lq mean median uq max neval
filter 551.522 570.2715 655.7250 586.3530 621.5590 13575.81 10000
slice 614.276 633.6840 735.0398 654.2455 695.3795 14123.43 10000
Upvotes: 2