Sid
Sid

Reputation: 13

Deleting duplicates with condition

I need to delete rows that are duplicate. Both the rows and with a condition. Please find below the sample data frame.

Sr. Invoice Status    Name
1   XXX     Booked    ABC
2   YYY     Booked    DEF
3   YYY     Cancelled DEF
4   ZZZ     Booked    GHI
5   ZZZ     Changed   GHI

I need to delete both instances of invoices that been repeated and status is cancelled.

This is what data frame should look like:

Sr. Invoice Status    Name
1   XXX     Booked    ABC
2   ZZZ     Booked    GHI
3   ZZZ     Changed   GHI

And I need to separate data frame for the deleted set. That is:

Sr. Invoice Status    Name
1   YYY     Booked    DEF
2   YYY     Cancelled DEF

Upvotes: 1

Views: 86

Answers (4)

DeduciveR
DeduciveR

Reputation: 1702

I think there's an easier way using tidyverse. Create groups and then filter at the group level using base all and any.

library(tidyverse) # Load library

To remove groups with a cancelled status:

df %>%
  group_by(Invoice) %>% 
  filter(all(Status != "Cancelled"))

# A tibble: 3 x 4
# Groups:   Invoice [2]
    Sr. Invoice Status  Name 
  <dbl> <chr>   <chr>   <chr>
1     1 XXX     Booked  ABC  
2     4 ZZZ     Booked  GHI  
3     5 ZZZ     Changed GHI

To separate groups with cancelled status:

df %>%
  group_by(Invoice) %>% 
  filter(any(Status == "Cancelled"))

# A tibble: 2 x 4
# Groups:   Invoice [1]
    Sr. Invoice Status    Name 
  <dbl> <chr>   <chr>     <chr>
1     2 YYY     Booked    DEF  
2     3 YYY     Cancelled DEF  

Upvotes: 0

moodymudskipper
moodymudskipper

Reputation: 47310

Using ave we build a boolean vector, with split we use it to create 2 separate data frames:

split(df1,ave(df1$Status, df1$Invoice, FUN = function(x) tail(x,1) != "Cancelled"))

# $`FALSE`
#   Sr. Invoice    Status Name
# 2   2     YYY    Booked  DEF
# 3   3     YYY Cancelled  DEF
# 
# $`TRUE`
#   Sr. Invoice  Status Name
# 1   1     XXX  Booked  ABC
# 4   4     ZZZ  Booked  GHI
# 5   5     ZZZ Changed  GHI

This was answered considering the comment on the question, which makes me think only the occurence of Cancelled as a last element is relevant.

data

df1 <- read.table(header=TRUE,stringsAsFactors=FALSE,text="Sr. Invoice Status    Name
1   XXX     Booked    ABC
2   YYY     Booked    DEF
3   YYY     Cancelled DEF
4   ZZZ     Booked    GHI
5   ZZZ     Changed   GHI")

Upvotes: 1

arg0naut91
arg0naut91

Reputation: 14764

You could also do:

library(dplyr)

df %>%
  group_by(Invoice) %>%
  mutate(Cancellation = +(any(Status == 'Cancelled'))) %>%
  split(., .$Cancellation) %>%
  setNames(., c("NoCancellations", "Cancellations")) %>%
  list2env(., .GlobalEnv)

This would throw 2 new data frames in your environment called NoCancellations and Cancellations (you can rename this as you wish).

NoCancellations

# A tibble: 3 x 5
# Groups:   Invoice [2]
     Sr Invoice Status  Name  Cancellation
  <int> <chr>   <chr>   <chr>        <int>
1     1 XXX     Book    AB               0
2     4 ZZZ     Book    GH               0
3     5 ZZZ     Changed GH               0

Cancellations

# A tibble: 2 x 5
# Groups:   Invoice [1]
     Sr Invoice Status    Name  Cancellation
  <int> <chr>   <chr>     <chr>        <int>
1     2 YYY     Book      DE               1
2     3 YYY     Cancelled DE               1

New data frames would also contain a column named Cancellation which was used for splitting; you can remove that if needed, e.g.:

df %>%
  group_by(Invoice) %>%
  mutate(Cancellation = +(any(Status == 'Cancelled'))) %>%
  split(., .$Cancellation) %>%
  lapply(., function(x) { x["Cancellation"] <- NULL; x }) %>%
  setNames(., c("NoCancellations", "Cancellations")) %>%
  list2env(., .GlobalEnv)

Instead of lapply, you could also use purrr::map(., ~ (.x %>% select(-Cancellation))) in that line.

Upvotes: 0

Spacedman
Spacedman

Reputation: 94182

Given a data set like this:

> d
  Sr Invoice    Status Name
1  1     XXX      Book   AB
2  2     YYY      Book   DE
3  3     YYY Cancelled   DE
4  4     ZZZ      Book   GH
5  5     ZZZ   Changed   GH

This expression finds all the Invoice codes for anything that has a Cancelled status:

> d$Invoice[d$Status=="Cancelled"]
[1] "YYY"

and then this expression returns true or false for all rows with or without those codes:

> d$Invoice %in% d$Invoice[d$Status=="Cancelled"]
[1] FALSE  TRUE  TRUE FALSE FALSE

You can then use that logical vector to split your data frame. For example using split to get a list of two elements:

> split(d, d$Invoice %in% d$Invoice[d$Status=="Cancelled"])
$`FALSE`
  Sr Invoice  Status Name
1  1     XXX    Book   AB
4  4     ZZZ    Book   GH
5  5     ZZZ Changed   GH

$`TRUE`
  Sr Invoice    Status Name
2  2     YYY      Book   DE
3  3     YYY Cancelled   DE

Upvotes: 1

Related Questions