Reputation: 4298
I have a data-set that may have duplicate products. If there are duplicate products, we would mark that order as "Y", and vice versa.
Here are my data:
dput(Test_File)
structure(list(Order = c(1234, 1234, 2345, 2345, 2345, 3456,
3456, 4567, 5678, 5678, 5678, 5678, 9999, 9999), Product = c("A12960",
"12960", "B3560", "3560", "A3850", "3850", "A3850", "A2920",
"2930", "2921", "A2921", "A2930", "A1234", "2345"), ASKU = c("Y",
"N", "N", "N", "Y", "N", "Y", "Y", "N", "N", "Y", "Y", "Y", "N"
)), .Names = c("Order", "Product", "ASKU"), row.names = c(NA,
14L), class = "data.frame")
Commentary about the data: I have a column that says ASKU
. This is mainly to identify whether a particular SKU starts with A
. Also, please note that A1234 and 1234 would be considered duplicate. A1234 and 23456 wouldn't be considered duplicate. Similarly, B1234 and 1234 won't be considered duplicates. So, to identify duplicates, one would ignore A
in ASKU
column.
Expected Output:
dput(Output_File)
structure(list(Order = c(1234, 2345, 3456, 4567, 5678, 9999),
Duplicate = c("Y", "N", "Y", "N", "Y", "N")), .Names = c("Order",
"Duplicate"), row.names = c(NA, 6L), class = "data.frame")
My code (non-working):
I tried this code, but I am getting an error. The idea is to do row-wise comparison of SKUs after extracting string from SKU name.
Error: Duplicate identifiers for rows (3, 4), (8, 9), (10, 11)
Test_File$New_SKU<-NA_character_
Test_File[grepl("^A",Test_File$Product,ignore.case = TRUE),"New_SKU"]<-sub("^A","",Test_File[grepl("^A",Test_File$Product,ignore.case = TRUE),"Product"])
Test_File[Test_File$ASKU=="N","New_SKU"]<-Test_File[Test_File$ASKU=="N","Product"]
Test_File %>%
dplyr::group_by(Order) %>%
dplyr::mutate(DCount = n_distinct(ASKU)) %>%
dplyr::filter(DCount>=2) %>%
dplyr::ungroup() %>%
dplyr::select(Order,New_SKU,ASKU) %>%
dplyr::distinct() %>%
tidyr::spread(key = ASKU,value = New_SKU)
Can someone please help me? I would appreciate if you could help me with dplyr
-based solution and data.table
-based solution.
Upvotes: 1
Views: 718
Reputation: 886938
Here is an option using data.table
library(data.table)
setDT(Test_File)[, .(Duplicate = c("N", "Y")[(anyDuplicated(sub("^A",
"", Product)) > 0)+1]), Order]
# Order Duplicate
#1: 1234 Y
#2: 2345 N
#3: 3456 Y
#4: 4567 N
#5: 5678 Y
#6: 9999 N
Or with base R
i1 <- with(Test_File, tapply(sub("^A", "", Product), Order, FUN = anyDuplicated)>0)
stack(split(i1, names(i1)))[2:1]
Upvotes: 1
Reputation: 388797
We can use dplyr
and remove the A
from the Product
which starts with it and then grouping by Order
we can check if there are any duplicate values in that group.
library(dplyr)
Test_File %>%
mutate(Product = sub("^A", "", Product)) %>%
group_by(Order) %>%
summarise(Duplicate = any(duplicated(Product)))
# Order Duplicate
# <dbl> <lgl>
#1 1234 TRUE
#2 2345 FALSE
#3 3456 TRUE
#4 4567 FALSE
#5 5678 TRUE
#6 9999 FALSE
If we need the output to be in Y/N format, the values of Duplicate
columns can be easily replaced using ifelse
Test_File %>%
mutate(Product = sub("^A", "", Product)) %>%
group_by(Order) %>%
summarise(Duplicate = ifelse(any(duplicated(Product)), "Y", "N"))
Upvotes: 1