Remko Duursma
Remko Duursma

Reputation: 2821

Efficient way to filter only first rows where condition is met?

Suppose I have a dataframe like,

library(dplyr)

data <- tibble(
   label = c("a","a","b","a","c","c","a")
)
data$index <- 1:nrow(data)

I don't want to subset all the rows where label == "a", but only the first rows where this is true.

In the example, I would want the first two rows :

  label index
  <chr> <int>
1 a         1
2 a         2

because the next row the label is "b". All subsequent rows where label == "a" should be ignored.

I have implemented an ugly solution with a for loop, but surely there is an efficient way to filter like this?

Upvotes: 3

Views: 1246

Answers (4)

akrun
akrun

Reputation: 887118

An option is also to do a comparison with the lag of the column, create a numeric index with cumsum and convert it to logical to filter

library(dplyr)
data %>% 
      filter(cumsum(label != lag(label, default = first(label))) < 1)
# A tibble: 2 x 2
#  label index
#  <chr> <int>
#1 a         1
#2 a         2

Upvotes: 0

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

You can use:

data %>% 
  filter(data.table::rleid(label) == 1)

# A tibble: 2 x 2
  label index
  <chr> <int>
1 a         1
2 a         2

Upvotes: 2

Karthik S
Karthik S

Reputation: 11584

If you want to use just rle:

library(dplyr)
data %>% filter(rep(seq_along(rle(label)$values), rle(label)$lengths) == 1)
# A tibble: 2 x 2
  label index
  <chr> <int>
1 a         1
2 a         2

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 39858

One option could be:

data %>%
 slice_max(label == "a", n = 2, with_ties = FALSE)

  label index
  <chr> <int>
1 a         1
2 a         2

However, it may generate unexpected results when the n is bigger than the actual group size. A solution to overcome this issue:

data %>%
 slice(head(which(label == "c"), 3))

  label index
  <chr> <int>
1 c         5
2 c         6

Upvotes: 1

Related Questions