Reputation: 4298
I have a data set that has an E-type SKU or non-E-type SKU. My objective is to find whether in any order I sold E-type SKU along with similar non-E-type SKU.
For instance, if I sold E123 with W123, this will be considered duplicate. If I sold E123 with another E123, this won't be considered duplicate. If I sold W123 with W123, it won't be considered duplicate either. To summarize, I need to find a duplicate that has at least one E-type SKU and at least one non-E-type SKU.
The previous example I asked on SO is similar (Find row-wise duplicates by groups) but the challenge with that methodology is that when sub("^E","", Product)
is applied, I no longer know whether I am comparing an E-type SKU with E-type SKU or an E-type SKU with non-E-type SKU.
Here's sample data:
dput(Test_File)
structure(list(Order = c(1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 5,
5, 5), Product = c("E12960", "E12960", "E12960", "W12960", "W12960",
"W12960", "E1234", "E2345", "W2355", "A1235", "C-A-1234", "W-1234",
"A-1234", "C-1234")), .Names = c("Order", "Product"), row.names = c(NA,
14L), class = "data.frame")
Here's expected output:
dput(Output_File)
structure(list(Order = c(1, 2, 3, 4, 5), Duplicate = c("N", "Y",
"N", "N", "N")), .Names = c("Order", "Duplicate"), row.names = c(NA,
5L), class = "data.frame")
Here's my working code:
Test_File[,"ESKU_Present"]<-grepl("^E",Test_File$Product,ignore.case = TRUE)
#Strip initial identifiers
toMatch<-c("^E","^W","^A","^C-","^W-","C-A-","^A-")
Test_File[grepl(paste(toMatch,collapse="|"),Test_File$Product,ignore.case = TRUE),"New_Product_ID"]<-sub(paste(toMatch,collapse="|"), "", Test_File$Product)
Output <- Test_File %>%
dplyr::group_by(Order) %>%
#find those orders that have at least one ESKU and one non-ESKU
mutate(Duplicate = (any(ESKU_Present ==c("FALSE")) & any(ESKU_Present == c("TRUE")))) %>%
dplyr::filter(Duplicate == "TRUE") %>%
dplyr::summarise(Final_Flag = any(duplicated(New_Product_ID))) %>%
right_join(Test_File) %>%
dplyr::select(Order, Final_Flag)
Output[is.na(Output$Final_Flag),"Final_Flag"]<-FALSE
Output<-dplyr::distinct(Output)
I have two questions:
a) How do I use data.table
to do what I am trying to do?
b) Is there any faster way to do this? I am asking this because on my original data set, which has about 1M rows, the above code is taking forever.
Upvotes: 0
Views: 42
Reputation: 886938
We can do this in data.table
. Convert the 'data.frame' to 'data.table' (setDT(Test_File)
), extract the non-numeric ('v1') and numeric substring by matching the numeric and non-numeric characters and replacing it with blanks, then grouped by 'Order' we check if there is any
non-numeric character that is an 'E' as well as the length
of unique
elements for non-numeric characters is greater than 1 and there is any duplicates characters for numeric part.
library(data.table)
setDT(Test_File)[, {
v1 <- sub("\\d+", "", Product)
v2 <- sub("\\D+", "", Product)
.(Duplicate = any(v1=="E") & uniqueN(v1)>1 & anyDuplicated(v2)>0)} , Order]
# Order Duplicate
#1: 1 FALSE
#2: 2 TRUE
#3: 3 FALSE
#4: 4 FALSE
#5: 5 FALSE
NOTE: It may be better to keep as logical column ('Duplicate') instead of 'Y/N'
Upvotes: 1