anarchy
anarchy

Reputation: 551

Deleting rows on conditions

For the given data frame:

id |conv |diff | counter  
 1 | 0 | 0  | 1  
 1 | 0 | 3  | 1  
 1 | 0 | 45 | 2   
 1 | 1 | 9  | 2   
 1 | 0 | 40 | 1  
 1 | 1 | 34 | 2  
 1 | 0 | 43 | 1  
 1 | 0 | 7  | 1  
 2 | 0 | 0  | 1  
 2 | 1 | 5  | 1  
 2 | 0 | 0  | 1  
 2 | 1 | 45 | 2  
 2 | 1 | 40 | 1  

I need to delete the rows which are not ending up as conv =1 when the id changes. For my example the rows with diff equals to 43 and 7 shall be removed.

Upvotes: 0

Views: 56

Answers (1)

akrun
akrun

Reputation: 887118

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', we find the index of the last element where 'conv' is 1 ('i1'), if it is less than the number of rows (.N), get the sequence from the next element to .N, get the row index (.I) and use that to remove the rows

library(data.table)
setDT(df1)[-df1[,  {i1 <- tail(which(conv==1), 1); if(i1 < .N) .I[(i1+1):.N]}, id]$V1]
#     id conv diff counter
# 1:  1    0    0       1
# 2:  1    0    3       1
# 3:  1    0   45       2
# 4:  1    1    9       2
# 5:  1    0   40       1
# 6:  1    1   34       2
# 7:  2    0    0       1
# 8:  2    1    5       1
# 9:  2    0    0       1
#10:  2    1   45       2
#11:  2    1   40       1

Or another option is after grouping by 'id', we get the cumulative sum of 'conv' and then multiply by 'conv', So, the 0 values remain as 0 by multiplication, then we can get the index of maximum value by which.max get the sequence and subset the dataset

setDT(df1)[, .SD[seq_len(which.max(cumsum(conv)*conv))], by = id]

To illustrate how this happens (extracting the column with $V1)

setDT(df1)[,  cumsum(conv), by = id]$V1
#[1] 0 0 0 1 1 2 2 2 0 1 1 2 3

Now, we find that the repeating 2s at the end for the 'id' 1. The last two '2' values for 'id' 1 are the ones we need to remove, so we multiply by 'conv'

setDT(df1)[,  cumsum(conv) * conv, by = id]$V1
#[1] 0 0 0 1 0 2 0 0 0 1 0 2 3

and get the index of max value

setDT(df1)[,  which.max(cumsum(conv) * conv), by = id]$V1
#[1] 6 5

and sequence it

setDT(df1)[,  seq_len(which.max(cumsum(conv) * conv)), by = id]$V1
#[1] 1 2 3 4 5 6 1 2 3 4 5

Use this to subset the data.table (.SD)

Upvotes: 1

Related Questions