Reputation: 551
For the given data frame:
id |conv |diff | counter
1 | 0 | 0 | 1
1 | 0 | 3 | 1
1 | 0 | 45 | 2
1 | 1 | 9 | 2
1 | 0 | 40 | 1
1 | 1 | 34 | 2
1 | 0 | 43 | 1
1 | 0 | 7 | 1
2 | 0 | 0 | 1
2 | 1 | 5 | 1
2 | 0 | 0 | 1
2 | 1 | 45 | 2
2 | 1 | 40 | 1
I need to delete the rows which are not ending up as conv =1 when the id changes. For my example the rows with diff equals to 43 and 7 shall be removed.
Upvotes: 0
Views: 56
Reputation: 887118
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by 'id', we find the index of the last element where 'conv' is 1 ('i1'), if
it is less than the number of rows (.N
), get the sequence from the next element to .N
, get the row index (.I
) and use that to remove the rows
library(data.table)
setDT(df1)[-df1[, {i1 <- tail(which(conv==1), 1); if(i1 < .N) .I[(i1+1):.N]}, id]$V1]
# id conv diff counter
# 1: 1 0 0 1
# 2: 1 0 3 1
# 3: 1 0 45 2
# 4: 1 1 9 2
# 5: 1 0 40 1
# 6: 1 1 34 2
# 7: 2 0 0 1
# 8: 2 1 5 1
# 9: 2 0 0 1
#10: 2 1 45 2
#11: 2 1 40 1
Or another option is after grouping by 'id', we get the cumulative sum of 'conv' and then multiply by 'conv', So, the 0 values remain as 0 by multiplication, then we can get the index of maximum value by which.max get the sequence and subset the dataset
setDT(df1)[, .SD[seq_len(which.max(cumsum(conv)*conv))], by = id]
To illustrate how this happens (extracting the column with $V1
)
setDT(df1)[, cumsum(conv), by = id]$V1
#[1] 0 0 0 1 1 2 2 2 0 1 1 2 3
Now, we find that the repeating 2s at the end for the 'id' 1. The last two '2' values for 'id' 1 are the ones we need to remove, so we multiply by 'conv'
setDT(df1)[, cumsum(conv) * conv, by = id]$V1
#[1] 0 0 0 1 0 2 0 0 0 1 0 2 3
and get the index of max value
setDT(df1)[, which.max(cumsum(conv) * conv), by = id]$V1
#[1] 6 5
and sequence it
setDT(df1)[, seq_len(which.max(cumsum(conv) * conv)), by = id]$V1
#[1] 1 2 3 4 5 6 1 2 3 4 5
Use this to subset the data.table (.SD
)
Upvotes: 1