Reputation: 65
I have a dataset with the following variables - Bill_Number,Item_Name,Quantity. A single Bill_Number can have multiple Item_Names under it. I am trying to summarize it by Bill_Number and Quantity wherein the Item_Names are concatenated into a single variable for each bill number. The solutions here :Combine several row variables work well for a limited and pre-defined frame (like given below), but I have hundreds of Item_Names. I am sure there is a simpler way to do this. Can anyone help please?
Data (sample only):
BillN<-c('B1','B1','B1','B1','B2','B2','B2','B2','B3','B3','B3','B3')
Item_Name<-c('Prod A','Prod B','Prod C','Prod D','Prod A','Prod B','Prod C','Prod D','Prod A','Prod B','Prod C','Prod D') # going on to Product(n)
Quantity<-c(1,2,1,2,1,2,1,1,1,2,1,1)
Output: using ply::ddply
ddply(within(Dummy1, {
Item_Name <- ifelse(Item_Name %in% c('Prod A','Prod B','Prod C'), 'Prod A + Prod B + Prod C', 'Prod D')
}), .(BillN, Item_Name), summarise, Count=sum(Quantity))
This depends on 'Prod D' being a separate row, whereas I need a combination of all the products for a given number and the overall quantity for that bill number.
The output I require is in this format:
Whereas the output I get from the code above is:
Extending the scenario, what happens if I have more variables such as the Brand, Category, Sub_category for each Item_Name and the Sales_Amount as a numeric variable to further sum it by?
Please let me know if you need more information or have questions.
Thanks, Raoul
Upvotes: 1
Views: 680
Reputation: 3280
Here is a quick solution using data.table
package:
Step1: Create the data.table
library(data.table)
DT <- data.table(
BillN=c('B1','B1','B1','B1','B2','B2','B2','B2','B3','B3','B3','B3'),
Item_Name=c('Prod A','Prod B','Prod C','Prod D','Prod A','Prod B','Prod C','Prod D','Prod A','Prod B','Prod C','Prod D'), # going on to Product(n)
Quantity=c(1,2,1,2,1,2,1,1,1,2,1,1)
)
Step2: Set appropriate key:
setkey(DT,BillN)
Step3: Make sure that the string vector Item_Name
is not a factor
.
DT[,Item_Name := as.character(Item_Name)]
Step4: Perform the operation by key
DT[,list(Item_Name =paste(Item_Name,collapse=" + "),
Quantity=sum(Quantity)),
by=key(DT)]
You can use the result as it is, or collapse each row! Which is simple!!
Upvotes: 1
Reputation: 78832
Here's an example summing both the Quantity
variables by BillN
and the count of BillN
s as your expected output wants:
Dummy1 <- data.frame(BillN, Item_Name, Quantity)
ldply(by(Dummy1, Dummy1$BillN, simplify=TRUE, FUN=function(x) {
BillN <- x[1,]$BillN
Quantity <- sum(x$Quantity)
Item_Name <- paste(as.character(unique(x$Item_Name)), sep="", collapse=" + ")
BillCt <- nrow(x)
return(data.frame(BillN, Item_Name, Quantity, BillCt))
}))[2:5]
## BillN Item_Name Quantity BillCt
## 1 B1 Prod A + Prod B + Prod C + Prod D 6 4
## 2 B2 Prod A + Prod B + Prod C + Prod D 5 4
## 3 B3 Prod A + Prod B + Prod C + Prod D 5 4
Upvotes: 0