akash87
akash87

Reputation: 3994

Filter between a Start point and Stop point

I have a dataset that looks like the following:

ID  Cond    Time1   Time2
1   2       Start   Stop1
1   3       Start   abc
1   1       abc     Stop2
1   2       Start   abc
1   2       abc     Stop1
2   2       Start   abc
2   4       abc     jkl
2   3       abc     jkl
2   2       abc     jkl
2   3       abc     Stop2
3   2       Start   abc
3   3       abc     Stop2
3   2       Start   Stop1
3   3       Start   Stop1
3   3       Start   abc
3   2       abc     jkl
3   4       baba    Stop1
4   2       Start   Stop2
4   1       Start   asd
4   2       abc     Stop2

And I need to filter the data based on a couple of criteria. If Cond = 2 and Time1 = Start, and I need to filter until the first stopping point (either Stop1 or Stop2). Essentially, it should look like this:

ID  Cond    Time1   Time2
1   2       Start   Stop1
1   2       Start   abc
1   2       abc     Stop1
2   2       Start   abc
2   4       abc     jkl
2   3       abc     jkl
2   2       abc     jkl
2   3       abc     Stop2
3   2       Start   abc
3   3       abc     Stop2
3   2       Start   Stop1
4   2       Start   Stop2

Also, the real dataset has over 140,000 observations, so efficienicy is key. I was thinking about using the dplyr package, but not sure how to go about this problem.

Upvotes: 3

Views: 118

Answers (3)

Frank
Frank

Reputation: 66819

Another data.table solution:

library(data.table)
setDT(DF)
DF[,     s0 := cumsum(Cond==2 & Time1 == "Start")]
DF[.N:1, s1 := cumsum(Time2 %like% "Stop")]

DF[, .SD[ s1 == s1[1L] ], by=s0]

    s0 ID Cond Time1 Time2 s1
 1:  1  1    2 Start Stop1 10
 2:  2  1    2 Start   abc  8
 3:  2  1    2   abc Stop1  8
 4:  3  2    2 Start   abc  7
 5:  3  2    4   abc   jkl  7
 6:  3  2    3   abc   jkl  7
 7:  3  2    2   abc   jkl  7
 8:  3  2    3   abc Stop2  7
 9:  4  3    2 Start   abc  6
10:  4  3    3   abc Stop2  6
11:  5  3    2 Start Stop1  5
12:  6  4    2 Start Stop2  2

.SD is the Subset of Data associated with each by=s0 group. The .N:1 in the second line reverses the data temporarily to create s1. If you don't want to keep the new columns around, they can be removed like DF[, s0 := NULL][, s1 := NULL] or DF[, c("s0", "s1") := NULL].

If the last line is slow, it is worth trying @eddi's approach:

DF[DF[, .I[ s1 == s1[1L] ], by=s0]$V1]

Upvotes: 2

mbiron
mbiron

Reputation: 4231

Using dplyr

dframe = read.table(header = T, text = "ID  Cond    Time1   Time2
1   2       Start   Stop1
                    1   3       Start   abc
                    1   1       abc     Stop2
                    1   2       Start   abc
                    1   2       abc     Stop1
                    2   2       Start   abc
                    2   4       abc     jkl
                    2   3       abc     jkl
                    2   2       abc     jkl
                    2   3       abc     Stop2
                    3   2       Start   abc
                    3   3       abc     Stop2
                    3   2       Start   Stop1
                    3   3       Start   Stop1
                    3   3       Start   abc
                    3   2       abc     jkl
                    3   4       baba    Stop1
                    4   2       Start   Stop2
                    4   1       Start   asd
                    4   2       abc     Stop2")

library(dplyr)

# add index
dframe = data.frame(index = 1:nrow(dframe), dframe)
head(dframe)

# get starting points
start_points = dframe %>%
  filter(Cond == 2 & Time1 == 'Start') %>%
  select(index, ID)

# get stopping points
stop_points = dframe %>%
  filter(substr(Time2, 1, 4) == 'Stop') %>%
  select(index, ID)

# get the stopping point associated with each start point
start_stop = start_points %>%
  left_join(stop_points, by = "ID") %>%
  filter(index.x <= index.y) %>%
  group_by(ID, index.x) %>%
  summarise(index.y = min(index.y)) %>%
  ungroup() %>%
  rename(start_index = index.x, stop_index = index.y)

# add rows between
result = start_stop %>%
  left_join(dframe, by = "ID") %>%
  filter(start_index <= index, index <= stop_index) %>%
  select(-c(start_index, stop_index, index))

> result
Source: local data frame [12 x 4]

ID  Cond  Time1  Time2
(int) (int) (fctr) (fctr)
1      1     2  Start  Stop1
2      1     2  Start    abc
3      1     2    abc  Stop1
4      2     2  Start    abc
5      2     4    abc    jkl
6      2     3    abc    jkl
7      2     2    abc    jkl
8      2     3    abc  Stop2
9      3     2  Start    abc
10     3     3    abc  Stop2
11     3     2  Start  Stop1
12     4     2  Start  Stop2

Upvotes: 2

akuiper
akuiper

Reputation: 214937

You can use Map to conditionally construct the series of rows you want to select, where an anonymous function can be used to judge if the start time has condition 2. Here is a solution, where we used the data.table for syntax sugar:

library(data.table)
setDT(df)
df[unlist(Map(function(t1, t2) if(t1 %in% which(Cond == 2)) t1:t2 else NULL, 
              which(Time1 == "Start"), which(grepl("Stop", Time2))))]
    ID Cond Time1 Time2
 1:  1    2 Start Stop1
 2:  1    2 Start   abc
 3:  1    2   abc Stop1
 4:  2    2 Start   abc
 5:  2    4   abc   jkl
 6:  2    3   abc   jkl
 7:  2    2   abc   jkl
 8:  2    3   abc Stop2
 9:  3    2 Start   abc
10:  3    3   abc Stop2
11:  3    2 Start Stop1
12:  4    2 Start Stop2

To boost the performance a little bit:

df[{ cond2 = which(Cond == 2); 
     unlist(Map(function(t1, t2) if(t1 %in% cond2) t1:t2 else NULL, 
                which(Time1 == "Start"), which(grepl("Stop", Time2)))) }]

Upvotes: 1

Related Questions