watchtower
watchtower

Reputation: 4298

Find row-wise duplicates by groups

I have a data-set that may have duplicate products. If there are duplicate products, we would mark that order as "Y", and vice versa.

Here are my data:

dput(Test_File)
structure(list(Order = c(1234, 1234, 2345, 2345, 2345, 3456, 
3456, 4567, 5678, 5678, 5678, 5678, 9999, 9999), Product = c("A12960", 
"12960", "B3560", "3560", "A3850", "3850", "A3850", "A2920", 
"2930", "2921", "A2921", "A2930", "A1234", "2345"), ASKU = c("Y", 
"N", "N", "N", "Y", "N", "Y", "Y", "N", "N", "Y", "Y", "Y", "N"
)), .Names = c("Order", "Product", "ASKU"), row.names = c(NA, 
14L), class = "data.frame")

Commentary about the data: I have a column that says ASKU. This is mainly to identify whether a particular SKU starts with A. Also, please note that A1234 and 1234 would be considered duplicate. A1234 and 23456 wouldn't be considered duplicate. Similarly, B1234 and 1234 won't be considered duplicates. So, to identify duplicates, one would ignore A in ASKU column.

Expected Output:

dput(Output_File)
structure(list(Order = c(1234, 2345, 3456, 4567, 5678, 9999), 
    Duplicate = c("Y", "N", "Y", "N", "Y", "N")), .Names = c("Order", 
"Duplicate"), row.names = c(NA, 6L), class = "data.frame")

My code (non-working):

I tried this code, but I am getting an error. The idea is to do row-wise comparison of SKUs after extracting string from SKU name.

Error: Duplicate identifiers for rows (3, 4), (8, 9), (10, 11)

Test_File$New_SKU<-NA_character_

Test_File[grepl("^A",Test_File$Product,ignore.case = TRUE),"New_SKU"]<-sub("^A","",Test_File[grepl("^A",Test_File$Product,ignore.case = TRUE),"Product"])

Test_File[Test_File$ASKU=="N","New_SKU"]<-Test_File[Test_File$ASKU=="N","Product"]

Test_File %>%
  dplyr::group_by(Order) %>%
  dplyr::mutate(DCount = n_distinct(ASKU)) %>%
  dplyr::filter(DCount>=2) %>%
  dplyr::ungroup() %>%
  dplyr::select(Order,New_SKU,ASKU) %>%
  dplyr::distinct() %>%
  tidyr::spread(key = ASKU,value = New_SKU)

Can someone please help me? I would appreciate if you could help me with dplyr-based solution and data.table-based solution.

Upvotes: 1

Views: 718

Answers (2)

akrun
akrun

Reputation: 886938

Here is an option using data.table

library(data.table)
setDT(Test_File)[, .(Duplicate = c("N", "Y")[(anyDuplicated(sub("^A",
                  "", Product)) > 0)+1]), Order]
#    Order Duplicate
#1:  1234         Y
#2:  2345         N
#3:  3456         Y
#4:  4567         N
#5:  5678         Y
#6:  9999         N

Or with base R

i1 <- with(Test_File, tapply(sub("^A", "", Product), Order, FUN = anyDuplicated)>0)
stack(split(i1, names(i1)))[2:1]

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388797

We can use dplyr and remove the A from the Product which starts with it and then grouping by Order we can check if there are any duplicate values in that group.

library(dplyr)
Test_File %>%
  mutate(Product = sub("^A", "", Product)) %>%
  group_by(Order) %>%
  summarise(Duplicate = any(duplicated(Product)))


#  Order Duplicate
#  <dbl>     <lgl>
#1  1234      TRUE
#2  2345     FALSE
#3  3456      TRUE
#4  4567     FALSE
#5  5678      TRUE
#6  9999     FALSE

If we need the output to be in Y/N format, the values of Duplicate columns can be easily replaced using ifelse

Test_File %>%
   mutate(Product = sub("^A", "", Product)) %>%
   group_by(Order) %>%
   summarise(Duplicate = ifelse(any(duplicated(Product)), "Y", "N"))

Upvotes: 1

Related Questions