Reputation: 507
Problem:
I have a dataset with multiple grouping variables and for each group I need to select rows from the first instance where the variable value
is >=5
.
Original data look like this:
id type time value
1: 1 1 1 1.002
2: 1 1 2 4.019
3: 1 1 3 5.048
4: 1 1 4 6.005
5: 1 1 5 4.108
6: 1 1 6 3.509
7: 1 2 1 2.104
8: 1 2 2 6.001
9: 1 2 3 5.903
10: 1 2 4 5.025
11: 1 2 5 3.907
12: 1 2 6 4.569
13: 5 1 1 4.006
14: 5 1 2 4.019
15: 5 1 3 4.908
16: 5 1 4 6.001
17: 5 1 5 4.199
18: 5 1 6 4.999
19: 5 2 1 0.009
20: 5 2 2 2.093
21: 5 2 3 3.081
22: 5 2 4 4.014
23: 5 2 5 4.998
24: 5 2 6 5.041
Possible solution:
In order to use the accepted dplyr
answer in this question, I added a logical variable to help me select rows and applied the filter:
sample.dt$state <- FALSE
sample.dt$state[sample.dt$value >=5] <- TRUE
sample.dt%>%
group_by(id, type)%>%
filter(cumsum(state)>0)
which does give me what I need:
id type time value state
<dbl> <dbl> <dbl> <dbl> <lgl>
1 1 1 3 5.048 TRUE
2 1 1 4 6.005 TRUE
3 1 1 5 4.108 FALSE
4 1 1 6 3.509 FALSE
5 1 2 2 6.001 TRUE
6 1 2 3 5.903 TRUE
7 1 2 4 5.025 TRUE
8 1 2 5 3.907 FALSE
9 1 2 6 4.569 FALSE
10 5 1 4 6.001 TRUE
11 5 1 5 4.199 FALSE
12 5 1 6 4.999 FALSE
13 5 2 6 5.041 TRUE
Question:
What is a better or more direct way of doing this? As I'll be applying it to a quite large data set with more nested grouping variables, I'd rather not have to create the logical variable to do this.
Sample data:
sample.dt <- data.table(id = c(1,1,1,1,1,1,1,1,1,1,1,1,5,5,5,5,5,5,5,5,5,5,5,5),
type = c(1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2),
time = c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6),
value = c(1.002,4.019,5.048,6.005,4.108,3.509,
2.104,6.001,5.903,5.025,3.907,4.569,
4.006,4.019,4.908,6.001,4.199,4.999,
0.009,2.093,3.081,4.014,4.998,5.041))
Upvotes: 1
Views: 889
Reputation: 3492
> sample.dt$var=ifelse(sample.dt$value>=5,TRUE,FALSE)
> sample.dt
id type time value var
1: 1 1 1 1.002 FALSE
2: 1 1 2 4.019 FALSE
3: 1 1 3 5.048 TRUE
4: 1 1 4 6.005 TRUE
5: 1 1 5 4.108 FALSE
6: 1 1 6 3.509 FALSE
7: 1 2 1 2.104 FALSE
8: 1 2 2 6.001 TRUE
9: 1 2 3 5.903 TRUE
10: 1 2 4 5.025 TRUE
11: 1 2 5 3.907 FALSE
12: 1 2 6 4.569 FALSE
13: 5 1 1 4.006 FALSE
14: 5 1 2 4.019 FALSE
15: 5 1 3 4.908 FALSE
16: 5 1 4 6.001 TRUE
17: 5 1 5 4.199 FALSE
18: 5 1 6 4.999 FALSE
19: 5 2 1 0.009 FALSE
20: 5 2 2 2.093 FALSE
21: 5 2 3 3.081 FALSE
22: 5 2 4 4.014 FALSE
23: 5 2 5 4.998 FALSE
24: 5 2 6 5.041 TRUE
> min(which(sample.dt$var== TRUE))
[1] 3
sample.dt[min(which(sample.dt$var== TRUE)),,]
id type time value var
1: 1 1 3 5.048 TRUE
or just
> sample.dt[min(which(ifelse(sample.dt$value>=5,TRUE,FALSE)== TRUE)),,]
id type time value var
1: 1 1 3 5.048 TRUE
Upvotes: 0
Reputation: 887048
As the initial dataset is data.table
, we can use data.table
approaches
sample.dt[, .SD[cumsum(value >=5) > 0] , by = .(id, type)]
A faster approach would be to extract the row index (.I
) and subset
sample.dt[sample.dt[, .I[cumsum(value >=5) > 0] , by = .(id, type)]$V1]
Upvotes: 2