Reputation: 13
I need to delete rows that are duplicate. Both the rows and with a condition. Please find below the sample data frame.
Sr. Invoice Status Name
1 XXX Booked ABC
2 YYY Booked DEF
3 YYY Cancelled DEF
4 ZZZ Booked GHI
5 ZZZ Changed GHI
I need to delete both instances of invoices that been repeated and status is cancelled.
This is what data frame should look like:
Sr. Invoice Status Name
1 XXX Booked ABC
2 ZZZ Booked GHI
3 ZZZ Changed GHI
And I need to separate data frame for the deleted set. That is:
Sr. Invoice Status Name
1 YYY Booked DEF
2 YYY Cancelled DEF
Upvotes: 1
Views: 86
Reputation: 1702
I think there's an easier way using tidyverse
. Create groups and then filter
at the group level using base all
and any
.
library(tidyverse) # Load library
To remove groups with a cancelled status:
df %>%
group_by(Invoice) %>%
filter(all(Status != "Cancelled"))
# A tibble: 3 x 4
# Groups: Invoice [2]
Sr. Invoice Status Name
<dbl> <chr> <chr> <chr>
1 1 XXX Booked ABC
2 4 ZZZ Booked GHI
3 5 ZZZ Changed GHI
To separate groups with cancelled status:
df %>%
group_by(Invoice) %>%
filter(any(Status == "Cancelled"))
# A tibble: 2 x 4
# Groups: Invoice [1]
Sr. Invoice Status Name
<dbl> <chr> <chr> <chr>
1 2 YYY Booked DEF
2 3 YYY Cancelled DEF
Upvotes: 0
Reputation: 47310
Using ave
we build a boolean vector, with split
we use it to create 2 separate data frames:
split(df1,ave(df1$Status, df1$Invoice, FUN = function(x) tail(x,1) != "Cancelled"))
# $`FALSE`
# Sr. Invoice Status Name
# 2 2 YYY Booked DEF
# 3 3 YYY Cancelled DEF
#
# $`TRUE`
# Sr. Invoice Status Name
# 1 1 XXX Booked ABC
# 4 4 ZZZ Booked GHI
# 5 5 ZZZ Changed GHI
This was answered considering the comment on the question, which makes me think only the occurence of Cancelled
as a last element is relevant.
data
df1 <- read.table(header=TRUE,stringsAsFactors=FALSE,text="Sr. Invoice Status Name
1 XXX Booked ABC
2 YYY Booked DEF
3 YYY Cancelled DEF
4 ZZZ Booked GHI
5 ZZZ Changed GHI")
Upvotes: 1
Reputation: 14764
You could also do:
library(dplyr)
df %>%
group_by(Invoice) %>%
mutate(Cancellation = +(any(Status == 'Cancelled'))) %>%
split(., .$Cancellation) %>%
setNames(., c("NoCancellations", "Cancellations")) %>%
list2env(., .GlobalEnv)
This would throw 2 new data frames in your environment called NoCancellations
and Cancellations
(you can rename this as you wish).
NoCancellations
# A tibble: 3 x 5
# Groups: Invoice [2]
Sr Invoice Status Name Cancellation
<int> <chr> <chr> <chr> <int>
1 1 XXX Book AB 0
2 4 ZZZ Book GH 0
3 5 ZZZ Changed GH 0
Cancellations
# A tibble: 2 x 5
# Groups: Invoice [1]
Sr Invoice Status Name Cancellation
<int> <chr> <chr> <chr> <int>
1 2 YYY Book DE 1
2 3 YYY Cancelled DE 1
New data frames would also contain a column named Cancellation
which was used for splitting; you can remove that if needed, e.g.:
df %>%
group_by(Invoice) %>%
mutate(Cancellation = +(any(Status == 'Cancelled'))) %>%
split(., .$Cancellation) %>%
lapply(., function(x) { x["Cancellation"] <- NULL; x }) %>%
setNames(., c("NoCancellations", "Cancellations")) %>%
list2env(., .GlobalEnv)
Instead of lapply
, you could also use purrr::map(., ~ (.x %>% select(-Cancellation)))
in that line.
Upvotes: 0
Reputation: 94182
Given a data set like this:
> d
Sr Invoice Status Name
1 1 XXX Book AB
2 2 YYY Book DE
3 3 YYY Cancelled DE
4 4 ZZZ Book GH
5 5 ZZZ Changed GH
This expression finds all the Invoice codes for anything that has a Cancelled status:
> d$Invoice[d$Status=="Cancelled"]
[1] "YYY"
and then this expression returns true or false for all rows with or without those codes:
> d$Invoice %in% d$Invoice[d$Status=="Cancelled"]
[1] FALSE TRUE TRUE FALSE FALSE
You can then use that logical vector to split your data frame. For example using split
to get a list of two elements:
> split(d, d$Invoice %in% d$Invoice[d$Status=="Cancelled"])
$`FALSE`
Sr Invoice Status Name
1 1 XXX Book AB
4 4 ZZZ Book GH
5 5 ZZZ Changed GH
$`TRUE`
Sr Invoice Status Name
2 2 YYY Book DE
3 3 YYY Cancelled DE
Upvotes: 1