Reputation: 183
I am new to R and trying to delete rows based on values of previous rows. Sample data:
Cust_ID | Date | Value
500219 | 2016-04-11 12:00:00 | 0
500219 | 2016-04-12 16:00:00 | 0
500219 | 2016-04-14 11:00:00 | 1
500219 | 2016-04-15 12:00:00 | 1
500219 | 2016-05-23 09:00:00 | 0
500219 | 2016-05-02 19:00:00 | 0
500220 | 2016-04-11 12:00:00 | 0
500220 | 2016-04-14 11:00:00 | 1
500220 | 2016-04-15 12:00:00 | 1
500220 | 2016-05-23 09:00:00 | 0
500220 | 2016-05-02 19:00:00 | 0
I would like to maintain only the rows before Value = 1 for each Cust_ID giving the result:
Cust_ID | Date | Value
500219 | 2016-04-11 12:00:00 | 0
500219 | 2016-04-12 16:00:00 | 0
500219 | 2016-04-14 11:00:00 | 1
500219 | 2016-04-15 12:00:00 | 1
500220 | 2016-04-11 12:00:00 | 0
500220 | 2016-04-14 11:00:00 | 1
500220 | 2016-04-15 12:00:00 | 1
Any help would be appreciated!
Upvotes: 2
Views: 1670
Reputation: 887153
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by 'Cust_ID', we get the sequence of max
of indexes where 'Value' is 1, and get the row index (.I
) and use that to subset the data.table rows.
library(data.table)
setDT(df1)[df1[, if(any(Value == 1)) .I[seq(max(which(Value == 1)))]
else .I[1:.N] , by = Cust_ID]$V1]
# Cust_ID Date Value
#1: 500219 2016-04-11 12:00:00 0
#2: 500219 2016-04-12 16:00:00 0
#3: 500219 2016-04-14 11:00:00 1
#4: 500219 2016-04-15 12:00:00 1
#5: 500220 2016-04-11 12:00:00 0
#6: 500220 2016-04-14 11:00:00 1
#7: 500220 2016-04-15 12:00:00 1
Or using a similar approach with dplyr
library(dplyr)
df1 %>%
group_by(Cust_ID) %>%
slice(if(any(Value==1)) seq(max(which(Value==1))) else row_number())
# Cust_ID Date Value
# <int> <chr> <int>
#1 500219 2016-04-11 12:00:00 0
#2 500219 2016-04-12 16:00:00 0
#3 500219 2016-04-14 11:00:00 1
#4 500219 2016-04-15 12:00:00 1
#5 500220 2016-04-11 12:00:00 0
#6 500220 2016-04-14 11:00:00 1
#7 500220 2016-04-15 12:00:00 1
Upvotes: 2
Reputation: 787
Looping approach:
cust <- 0
keep <- FALSE
keepers <- vector(mode = "logical", length = nrow(df))
## walk through the dataframe backwards
for(rec in nrow(df):1)
{
## have we been working with this customer?
if(df[rec,]$Cust_ID == cust)
{
if(df[rec,]$Value == 1 | keep == TRUE)
{
keepers[rec] = TRUE
keep <- TRUE
}
}
else
{
cust = df[rec,]$Cust_ID
if(df[rec,]$Value == 1)
{
keepers[rec] = TRUE
keep <- TRUE
}
else
{
keep <- FALSE
}
}
}
df <- df[keepers,]
df
Upvotes: 0
Reputation: 38510
Here is a split-apply-combine method that keeps any values that are 1 as well as the values before the first 1 for each customer.
# split data by customer ID
myList <- split(df, df$Cust_ID)
# loop through ID list, drop desired rows, rbind resulting list
dfNew <- do.call(rbind, lapply(myList, function(i) {
drop <- which(i$Value==1)
i[c(1:drop[1], drop[-1]),]}))
which returns
dfNew
Cust_ID Date Value
500219.1 500219 2016-04-11 12:00:00 0
500219.2 500219 2016-04-12 16:00:00 0
500219.3 500219 2016-04-14 11:00:00 1
500219.4 500219 2016-04-15 12:00:00 1
500220.7 500220 2016-04-11 12:00:00 0
500220.8 500220 2016-04-14 11:00:00 1
500220.9 500220 2016-04-15 12:00:00 1
Note that this solution will not work if there are customer IDs that never have a value equal to 1.
If you want to retain observations that never reach the 1 threshold, then use
dfNew <- do.call(rbind, lapply(myList, function(i) {
drop <- which(i$Value==1)
if(length(drop) != 0) i[c(1:drop[1], drop[-1]),]
else i}))
Upvotes: 2