select rows from first instance of a value by group

Question

Problem:

I have a dataset with multiple grouping variables and for each group I need to select rows from the first instance where the variable value is >=5.

Original data look like this:

   id type time value
 1:  1    1    1 1.002
 2:  1    1    2 4.019
 3:  1    1    3 5.048
 4:  1    1    4 6.005
 5:  1    1    5 4.108
 6:  1    1    6 3.509
 7:  1    2    1 2.104
 8:  1    2    2 6.001
 9:  1    2    3 5.903
10:  1    2    4 5.025
11:  1    2    5 3.907
12:  1    2    6 4.569
13:  5    1    1 4.006
14:  5    1    2 4.019
15:  5    1    3 4.908
16:  5    1    4 6.001
17:  5    1    5 4.199
18:  5    1    6 4.999
19:  5    2    1 0.009
20:  5    2    2 2.093
21:  5    2    3 3.081
22:  5    2    4 4.014
23:  5    2    5 4.998
24:  5    2    6 5.041

Possible solution:

In order to use the accepted dplyr answer in this question, I added a logical variable to help me select rows and applied the filter:

sample.dt$state <- FALSE
sample.dt$state[sample.dt$value >=5] <- TRUE

sample.dt%>%
  group_by(id, type)%>%
  filter(cumsum(state)>0)

which does give me what I need:

       id  type  time value state
        
 1      1     1     3 5.048  TRUE
 2      1     1     4 6.005  TRUE
 3      1     1     5 4.108 FALSE
 4      1     1     6 3.509 FALSE
 5      1     2     2 6.001  TRUE
 6      1     2     3 5.903  TRUE
 7      1     2     4 5.025  TRUE
 8      1     2     5 3.907 FALSE
 9      1     2     6 4.569 FALSE
 10     5     1     4 6.001  TRUE
 11     5     1     5 4.199 FALSE
 12     5     1     6 4.999 FALSE
 13     5     2     6 5.041  TRUE

Question:

What is a better or more direct way of doing this? As I'll be applying it to a quite large data set with more nested grouping variables, I'd rather not have to create the logical variable to do this.

Sample data:

 sample.dt <- data.table(id = c(1,1,1,1,1,1,1,1,1,1,1,1,5,5,5,5,5,5,5,5,5,5,5,5),
                    type = c(1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2), 
                    time = c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6),
                    value = c(1.002,4.019,5.048,6.005,4.108,3.509,
                              2.104,6.001,5.903,5.025,3.907,4.569,
                              4.006,4.019,4.908,6.001,4.199,4.999,
                              0.009,2.093,3.081,4.014,4.998,5.041))

akrun · Accepted Answer

As the initial dataset is data.table, we can use data.table approaches

sample.dt[, .SD[cumsum(value >=5) > 0] , by = .(id, type)]

A faster approach would be to extract the row index (.I) and subset

sample.dt[sample.dt[, .I[cumsum(value >=5) > 0] , by = .(id, type)]$V1]

select rows from first instance of a value by group

Answers (2)

Related Questions