Reputation: 3994
I have a dataset that looks like the following:
ID Cond Time1 Time2
1 2 Start Stop1
1 3 Start abc
1 1 abc Stop2
1 2 Start abc
1 2 abc Stop1
2 2 Start abc
2 4 abc jkl
2 3 abc jkl
2 2 abc jkl
2 3 abc Stop2
3 2 Start abc
3 3 abc Stop2
3 2 Start Stop1
3 3 Start Stop1
3 3 Start abc
3 2 abc jkl
3 4 baba Stop1
4 2 Start Stop2
4 1 Start asd
4 2 abc Stop2
And I need to filter the data based on a couple of criteria. If Cond = 2
and Time1 = Start
, and I need to filter until the first stopping point (either Stop1
or Stop2
). Essentially, it should look like this:
ID Cond Time1 Time2
1 2 Start Stop1
1 2 Start abc
1 2 abc Stop1
2 2 Start abc
2 4 abc jkl
2 3 abc jkl
2 2 abc jkl
2 3 abc Stop2
3 2 Start abc
3 3 abc Stop2
3 2 Start Stop1
4 2 Start Stop2
Also, the real dataset has over 140,000 observations, so efficienicy is key. I was thinking about using the dplyr
package, but not sure how to go about this problem.
Upvotes: 3
Views: 118
Reputation: 66819
Another data.table solution:
library(data.table)
setDT(DF)
DF[, s0 := cumsum(Cond==2 & Time1 == "Start")]
DF[.N:1, s1 := cumsum(Time2 %like% "Stop")]
DF[, .SD[ s1 == s1[1L] ], by=s0]
s0 ID Cond Time1 Time2 s1
1: 1 1 2 Start Stop1 10
2: 2 1 2 Start abc 8
3: 2 1 2 abc Stop1 8
4: 3 2 2 Start abc 7
5: 3 2 4 abc jkl 7
6: 3 2 3 abc jkl 7
7: 3 2 2 abc jkl 7
8: 3 2 3 abc Stop2 7
9: 4 3 2 Start abc 6
10: 4 3 3 abc Stop2 6
11: 5 3 2 Start Stop1 5
12: 6 4 2 Start Stop2 2
.SD
is the Subset of Data associated with each by=s0
group. The .N:1
in the second line reverses the data temporarily to create s1
. If you don't want to keep the new columns around, they can be removed like DF[, s0 := NULL][, s1 := NULL]
or DF[, c("s0", "s1") := NULL]
.
If the last line is slow, it is worth trying @eddi's approach:
DF[DF[, .I[ s1 == s1[1L] ], by=s0]$V1]
Upvotes: 2
Reputation: 4231
Using dplyr
dframe = read.table(header = T, text = "ID Cond Time1 Time2
1 2 Start Stop1
1 3 Start abc
1 1 abc Stop2
1 2 Start abc
1 2 abc Stop1
2 2 Start abc
2 4 abc jkl
2 3 abc jkl
2 2 abc jkl
2 3 abc Stop2
3 2 Start abc
3 3 abc Stop2
3 2 Start Stop1
3 3 Start Stop1
3 3 Start abc
3 2 abc jkl
3 4 baba Stop1
4 2 Start Stop2
4 1 Start asd
4 2 abc Stop2")
library(dplyr)
# add index
dframe = data.frame(index = 1:nrow(dframe), dframe)
head(dframe)
# get starting points
start_points = dframe %>%
filter(Cond == 2 & Time1 == 'Start') %>%
select(index, ID)
# get stopping points
stop_points = dframe %>%
filter(substr(Time2, 1, 4) == 'Stop') %>%
select(index, ID)
# get the stopping point associated with each start point
start_stop = start_points %>%
left_join(stop_points, by = "ID") %>%
filter(index.x <= index.y) %>%
group_by(ID, index.x) %>%
summarise(index.y = min(index.y)) %>%
ungroup() %>%
rename(start_index = index.x, stop_index = index.y)
# add rows between
result = start_stop %>%
left_join(dframe, by = "ID") %>%
filter(start_index <= index, index <= stop_index) %>%
select(-c(start_index, stop_index, index))
> result
Source: local data frame [12 x 4]
ID Cond Time1 Time2
(int) (int) (fctr) (fctr)
1 1 2 Start Stop1
2 1 2 Start abc
3 1 2 abc Stop1
4 2 2 Start abc
5 2 4 abc jkl
6 2 3 abc jkl
7 2 2 abc jkl
8 2 3 abc Stop2
9 3 2 Start abc
10 3 3 abc Stop2
11 3 2 Start Stop1
12 4 2 Start Stop2
Upvotes: 2
Reputation: 214937
You can use Map
to conditionally construct the series of rows you want to select, where an anonymous function can be used to judge if the start time has condition 2. Here is a solution, where we used the data.table
for syntax sugar:
library(data.table)
setDT(df)
df[unlist(Map(function(t1, t2) if(t1 %in% which(Cond == 2)) t1:t2 else NULL,
which(Time1 == "Start"), which(grepl("Stop", Time2))))]
ID Cond Time1 Time2
1: 1 2 Start Stop1
2: 1 2 Start abc
3: 1 2 abc Stop1
4: 2 2 Start abc
5: 2 4 abc jkl
6: 2 3 abc jkl
7: 2 2 abc jkl
8: 2 3 abc Stop2
9: 3 2 Start abc
10: 3 3 abc Stop2
11: 3 2 Start Stop1
12: 4 2 Start Stop2
To boost the performance a little bit:
df[{ cond2 = which(Cond == 2);
unlist(Map(function(t1, t2) if(t1 %in% cond2) t1:t2 else NULL,
which(Time1 == "Start"), which(grepl("Stop", Time2)))) }]
Upvotes: 1