Reputation: 585
I'm trying to analyze game data but I need to remove of all rows after a specified row.
In the following case I want to remove all rows after the EVENT "Die" for each users. Data is sorted by UID, TIME.HOUR.
df:
UID TIME.HOUR EVENT
1 5 Run
1 5 Run
1 6 Run
1 7 Die
1 8 Run
1 9 Run
2 14 Jump
2 15 Die
2 16 Run
2 17 Run
Expected result:
UID TIME.HOUR EVENT
1 5 Run
1 5 Run
1 6 Run
1 7 Die
2 14 Jump
2 15 Die
I think i'm on the right track with the code below but don't struggling with the next step.
args <- which(df$EVENT== "Die")
df[,c(sapply(args, function(x) ???), by = UID] #seq? range?
Thank you.
Upvotes: 3
Views: 62
Reputation: 887391
We can use data.table
. Convert the 'data.frame' to 'data.table', grouped by 'UID', get a double cumsum
of logical vector (EVENT == "Die"
), check whether it is less than 2 to Subset the Data.table (.SD
)
library(data.table)
setDT(df)[, .SD[cumsum(cumsum(EVENT=="Die"))<2] , UID]
# UID TIME.HOUR EVENT
#1: 1 5 Run
#2: 1 5 Run
#3: 1 6 Run
#4: 1 7 Die
#5: 2 14 Jump
#6: 2 15 Die
Or a faster approach: to get the row index, extract that column ($V1
) to subset the data
setDT(df)[df[, .I[cumsum(cumsum(EVENT=="Die"))<2] , UID]$V1]
Or a modification of @Psidom's approach
setDT(df)[df[, .I[seq(match("Die", EVENT, nomatch = .N))] , UID]$V1]
Or use dplyr
library(dplyr)
df %>%
group_by(UID) %>%
slice(seq(match("Die", EVENT, nomatch = n())))
# UID TIME.HOUR EVENT
# <int> <int> <chr>
#1 1 5 Run
#2 1 5 Run
#3 1 6 Run
#4 1 7 Die
#5 2 14 Jump
#6 2 15 Die
In case, we need a data.frame
output, chain with %>% as.data.frame
(from @R.S. comments)
Upvotes: 4
Reputation: 66819
This probably isn't so efficient, but you could do a fancy join:
mdf = df[EVENT == "Die", head(.SD, 1L), by=UID, .SDcols = "TIME.HOUR"]
df[!mdf, on=.(UID, TIME.HOUR > TIME.HOUR)]
UID TIME.HOUR EVENT
1: 1 5 Run
2: 1 5 Run
3: 1 6 Run
4: 1 7 Die
5: 2 14 Jump
6: 2 15 Die
You don't actually need to save the mdf
table as a separate object, of course.
How it works
x[!i]
, where i
is another data.table or list of vectors, is an anti-join, telling R to exclude rows of x
based on i
, similar to how it works with vectors (where i
would have to be a logical vector).
The on=.(ID, v >= v)
option tells R that we're doing a "non-equi join." The v >= v
part means that the v
from i
(on the left-hand side) should be greater than the v
from x
(on the right-hand side).
Combining these two, we're excluding rows that meet the criteria specified in the on=
.
Side notes. There are a couple things I'm not sure about: Do we have a better name than non-equi join? Why is the v
on the left from i
even though x[i]
has x
to the left of i
?
I borrowed from both Psidom and akrun's answer, using head
and an inequality. One (maybe?) advantage here is that head(.SD, 1L)
is optimized while head(.SD, expr)
is not yet.
Upvotes: 3
Reputation: 215047
Another option, you can use head()
with match()
(find the first Die
index):
dt[, head(.SD, match("Die", EVENT, nomatch = .N)), UID] # if no match is found within the
# group, return the whole group
# UID TIME.HOUR EVENT
#1: 1 5 Run
#2: 1 5 Run
#3: 1 6 Run
#4: 1 7 Die
#5: 2 14 Jump
#6: 2 15 Die
Upvotes: 2