ant
ant

Reputation: 585

Removing all rows under a specified row in a time series

I'm trying to analyze game data but I need to remove of all rows after a specified row.

In the following case I want to remove all rows after the EVENT "Die" for each users. Data is sorted by UID, TIME.HOUR.

df:

 UID  TIME.HOUR  EVENT
  1      5       Run
  1      5       Run
  1      6       Run
  1      7       Die
  1      8       Run
  1      9       Run
  2      14      Jump
  2      15      Die
  2      16      Run
  2      17      Run

Expected result:

 UID  TIME.HOUR  EVENT
  1      5       Run
  1      5       Run
  1      6       Run
  1      7       Die
  2      14      Jump
  2      15      Die

I think i'm on the right track with the code below but don't struggling with the next step.

 args <- which(df$EVENT== "Die")
 df[,c(sapply(args, function(x) ???), by = UID] #seq? range? 

Thank you.

Upvotes: 3

Views: 62

Answers (3)

akrun
akrun

Reputation: 887391

We can use data.table. Convert the 'data.frame' to 'data.table', grouped by 'UID', get a double cumsum of logical vector (EVENT == "Die"), check whether it is less than 2 to Subset the Data.table (.SD)

library(data.table)
setDT(df)[, .SD[cumsum(cumsum(EVENT=="Die"))<2] , UID]
#   UID TIME.HOUR EVENT
#1:   1         5   Run
#2:   1         5   Run
#3:   1         6   Run
#4:   1         7   Die
#5:   2        14  Jump
#6:   2        15   Die

Or a faster approach: to get the row index, extract that column ($V1) to subset the data

setDT(df)[df[, .I[cumsum(cumsum(EVENT=="Die"))<2] , UID]$V1]

Or a modification of @Psidom's approach

setDT(df)[df[, .I[seq(match("Die", EVENT, nomatch = .N))] , UID]$V1]

Or use dplyr

library(dplyr)
df %>%
   group_by(UID) %>% 
   slice(seq(match("Die", EVENT, nomatch = n())))
#    UID TIME.HOUR EVENT
#  <int>     <int> <chr>
#1     1         5   Run
#2     1         5   Run
#3     1         6   Run
#4     1         7   Die
#5     2        14  Jump
#6     2        15   Die

In case, we need a data.frame output, chain with %>% as.data.frame (from @R.S. comments)

Upvotes: 4

Frank
Frank

Reputation: 66819

This probably isn't so efficient, but you could do a fancy join:

mdf = df[EVENT == "Die", head(.SD, 1L), by=UID, .SDcols = "TIME.HOUR"]
df[!mdf, on=.(UID, TIME.HOUR > TIME.HOUR)]

   UID TIME.HOUR EVENT
1:   1         5   Run
2:   1         5   Run
3:   1         6   Run
4:   1         7   Die
5:   2        14  Jump
6:   2        15   Die

You don't actually need to save the mdf table as a separate object, of course.


How it works

  • x[!i], where i is another data.table or list of vectors, is an anti-join, telling R to exclude rows of x based on i, similar to how it works with vectors (where i would have to be a logical vector).

  • The on=.(ID, v >= v) option tells R that we're doing a "non-equi join." The v >= v part means that the v from i (on the left-hand side) should be greater than the v from x (on the right-hand side).

Combining these two, we're excluding rows that meet the criteria specified in the on=.


Side notes. There are a couple things I'm not sure about: Do we have a better name than non-equi join? Why is the v on the left from i even though x[i] has x to the left of i?

I borrowed from both Psidom and akrun's answer, using head and an inequality. One (maybe?) advantage here is that head(.SD, 1L) is optimized while head(.SD, expr) is not yet.

Upvotes: 3

akuiper
akuiper

Reputation: 215047

Another option, you can use head() with match()(find the first Die index):

dt[, head(.SD, match("Die", EVENT, nomatch = .N)), UID]   # if no match is found within the 
                                                          # group, return the whole group

#   UID TIME.HOUR EVENT
#1:   1         5   Run
#2:   1         5   Run
#3:   1         6   Run
#4:   1         7   Die
#5:   2        14  Jump
#6:   2        15   Die

Upvotes: 2

Related Questions