Find row-wise duplicates by groups

Question

I have a data-set that may have duplicate products. If there are duplicate products, we would mark that order as "Y", and vice versa.

Here are my data:

dput(Test_File)
structure(list(Order = c(1234, 1234, 2345, 2345, 2345, 3456, 
3456, 4567, 5678, 5678, 5678, 5678, 9999, 9999), Product = c("A12960", 
"12960", "B3560", "3560", "A3850", "3850", "A3850", "A2920", 
"2930", "2921", "A2921", "A2930", "A1234", "2345"), ASKU = c("Y", 
"N", "N", "N", "Y", "N", "Y", "Y", "N", "N", "Y", "Y", "Y", "N"
)), .Names = c("Order", "Product", "ASKU"), row.names = c(NA, 
14L), class = "data.frame")

Commentary about the data: I have a column that says ASKU. This is mainly to identify whether a particular SKU starts with A. Also, please note that A1234 and 1234 would be considered duplicate. A1234 and 23456 wouldn't be considered duplicate. Similarly, B1234 and 1234 won't be considered duplicates. So, to identify duplicates, one would ignore A in ASKU column.

Expected Output:

dput(Output_File)
structure(list(Order = c(1234, 2345, 3456, 4567, 5678, 9999), 
    Duplicate = c("Y", "N", "Y", "N", "Y", "N")), .Names = c("Order", 
"Duplicate"), row.names = c(NA, 6L), class = "data.frame")

My code (non-working):

I tried this code, but I am getting an error. The idea is to do row-wise comparison of SKUs after extracting string from SKU name.

Error: Duplicate identifiers for rows (3, 4), (8, 9), (10, 11)

Test_File$New_SKU<-NA_character_

Test_File[grepl("^A",Test_File$Product,ignore.case = TRUE),"New_SKU"]<-sub("^A","",Test_File[grepl("^A",Test_File$Product,ignore.case = TRUE),"Product"])

Test_File[Test_File$ASKU=="N","New_SKU"]<-Test_File[Test_File$ASKU=="N","Product"]

Test_File %>%
  dplyr::group_by(Order) %>%
  dplyr::mutate(DCount = n_distinct(ASKU)) %>%
  dplyr::filter(DCount>=2) %>%
  dplyr::ungroup() %>%
  dplyr::select(Order,New_SKU,ASKU) %>%
  dplyr::distinct() %>%
  tidyr::spread(key = ASKU,value = New_SKU)

Can someone please help me? I would appreciate if you could help me with dplyr-based solution and data.table-based solution.

akrun · Accepted Answer

Here is an option using data.table

library(data.table)
setDT(Test_File)[, .(Duplicate = c("N", "Y")[(anyDuplicated(sub("^A",
                  "", Product)) > 0)+1]), Order]
#    Order Duplicate
#1:  1234         Y
#2:  2345         N
#3:  3456         Y
#4:  4567         N
#5:  5678         Y
#6:  9999         N

Or with base R

i1 <- with(Test_File, tapply(sub("^A", "", Product), Order, FUN = anyDuplicated)>0)
stack(split(i1, names(i1)))[2:1]

Find row-wise duplicates by groups

Answers (2)

Related Questions