watchtower
watchtower

Reputation: 4298

Find row-wise duplicates by groups for a specific item

I have a data set that has an E-type SKU or non-E-type SKU. My objective is to find whether in any order I sold E-type SKU along with similar non-E-type SKU.

For instance, if I sold E123 with W123, this will be considered duplicate. If I sold E123 with another E123, this won't be considered duplicate. If I sold W123 with W123, it won't be considered duplicate either. To summarize, I need to find a duplicate that has at least one E-type SKU and at least one non-E-type SKU.

The previous example I asked on SO is similar (Find row-wise duplicates by groups) but the challenge with that methodology is that when sub("^E","", Product) is applied, I no longer know whether I am comparing an E-type SKU with E-type SKU or an E-type SKU with non-E-type SKU.

Here's sample data:

dput(Test_File)
structure(list(Order = c(1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 5, 
5, 5), Product = c("E12960", "E12960", "E12960", "W12960", "W12960", 
"W12960", "E1234", "E2345", "W2355", "A1235", "C-A-1234", "W-1234", 
"A-1234", "C-1234")), .Names = c("Order", "Product"), row.names = c(NA, 
14L), class = "data.frame")

Here's expected output:

dput(Output_File)
structure(list(Order = c(1, 2, 3, 4, 5), Duplicate = c("N", "Y", 
"N", "N", "N")), .Names = c("Order", "Duplicate"), row.names = c(NA, 
5L), class = "data.frame")

Here's my working code:

Test_File[,"ESKU_Present"]<-grepl("^E",Test_File$Product,ignore.case = TRUE)

#Strip initial identifiers
toMatch<-c("^E","^W","^A","^C-","^W-","C-A-","^A-")
Test_File[grepl(paste(toMatch,collapse="|"),Test_File$Product,ignore.case = TRUE),"New_Product_ID"]<-sub(paste(toMatch,collapse="|"), "", Test_File$Product)

Output <- Test_File %>%
  dplyr::group_by(Order) %>%
  #find those orders that have at least one ESKU and one non-ESKU
  mutate(Duplicate = (any(ESKU_Present ==c("FALSE")) & any(ESKU_Present == c("TRUE")))) %>%
  dplyr::filter(Duplicate == "TRUE") %>%
  dplyr::summarise(Final_Flag = any(duplicated(New_Product_ID))) %>%
  right_join(Test_File) %>%
  dplyr::select(Order, Final_Flag)

Output[is.na(Output$Final_Flag),"Final_Flag"]<-FALSE
Output<-dplyr::distinct(Output)

I have two questions:

a) How do I use data.table to do what I am trying to do? b) Is there any faster way to do this? I am asking this because on my original data set, which has about 1M rows, the above code is taking forever.

Upvotes: 0

Views: 42

Answers (1)

akrun
akrun

Reputation: 886938

We can do this in data.table. Convert the 'data.frame' to 'data.table' (setDT(Test_File)), extract the non-numeric ('v1') and numeric substring by matching the numeric and non-numeric characters and replacing it with blanks, then grouped by 'Order' we check if there is any non-numeric character that is an 'E' as well as the length of unique elements for non-numeric characters is greater than 1 and there is any duplicates characters for numeric part.

library(data.table) 
setDT(Test_File)[, {
           v1 <- sub("\\d+", "", Product)
           v2 <- sub("\\D+", "", Product)
         .(Duplicate = any(v1=="E") & uniqueN(v1)>1 & anyDuplicated(v2)>0)} , Order]
#     Order Duplicate
#1:     1     FALSE
#2:     2      TRUE
#3:     3     FALSE
#4:     4     FALSE
#5:     5     FALSE

NOTE: It may be better to keep as logical column ('Duplicate') instead of 'Y/N'

Upvotes: 1

Related Questions