Reputation: 149

R: subset dataframe for all rows after a condition is met

So I'm having a dataset of the following form:

ID    Var1   Var2
1      2      0
1      8      0
1      12     0
1      11     1
1      10     1
2      5      0
2      8      0
2      7      0
2      6      1
2      5      1

I would like to subset the dataframe and create a new dataframe, containing only the rows after Var1 first reached its group-maximum (including the row this happens) up to the row where Var2 becomes 1 for the first time (also including this row). So what I'd like to have should look like this:

ID    Var1   Var2
1      12     0
1      11     1
2      8      0
2      7      0
2      6      1

The original dataset contains a number of NAs and the function should simply ignore those. Also if Var2 never reaches "1" for a group is should just add all rows to the new dataframe (of course only the ones after Var1 reaches its group maximum). However I cannot wrap my hand around the programming. Does anyone know help?

Upvotes: 1

Answers (3)

Rui Barradas

Reputation: 76450

A dplyr solution with cumsum based filter will do what the question asks for.

library(dplyr)

df1 %>%
  group_by(ID) %>%
  filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
## A tibble: 5 x 3
## Groups:   ID [2]
#     ID  Var1  Var2
#  <int> <int> <int>
#1     1    12     0
#2     1    11     1
#3     2     8     0
#4     2     7     0
#5     2     6     1

Edit

Here is a solution that tries to answer to the OP's comment and question edit.

df1 %>%
  group_by(ID) %>%
  mutate_at(vars(starts_with('Var')), ~replace_na(., 0L)) %>%
  filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)

Data

df1 <- read.table(text = "
ID    Var1   Var2
1      2      0
1      8      0
1      12     0
1      11     1
1      10     1
2      5      0
2      8      0
2      7      0
2      6      1
2      5      1
", header = TRUE)

Upvotes: 3

akrun

Reputation: 887213

Using data.table with .I

library(data.table)
setDT(df1)[df1[, .I[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]$V1]
#   ID Var1 Var2
#1:  1   12    0
#2:  1   11    1
#3:  2    8    0
#4:  2    7    0
#5:  2    6    1

data

df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), 
    Var1 = c(2L, 8L, 12L, 11L, 10L, 5L, 8L, 7L, 6L, 5L), Var2 = c(0L, 
    0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame",
    row.names = c(NA, 
-10L))

Upvotes: 1

Vincent

Reputation: 17715

Here is data.table translation of Rui Barradas' working solution:

library(data.table)

dat <- fread(text = "
ID    Var1   Var2
1      2      0
1      8      0
1      12     0
1      11     1
1      10     1
2      5      0
2      8      0
2      7      0
2      6      1
2      5      1
", header = TRUE)

dat[, .SD[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]

Upvotes: 0

R: subset dataframe for all rows after a condition is met

Answers (3)

data

Related Questions