Reputation: 83
I am trying to create new variables in my data set that are cumulative totals which restart based on other variables (using group by)… I want these to be new columns in the data set and this is the part I am struggling with...
Using the data below, I want to create cumulative Sale and Profit columns that will restart for every Product and Product_Cat grouping.
The below code partly gives me what I need, but the variables are not new variables, instead it overwrites the existing Sale/Profit... what am I getting wrong? I imagine this is simple haven't found anything.
Note: I'm using lapply as my real data set has 40+ varbs that I need to create calculations for.
DT <- setDT(Data)[,lapply(.SD, cumsum), by = .(Product,Product_Cat) ]
Data for example:
Product <- c('A','A','A','B','B','B','C','C','C')
Product_Cat <- c('S1','S1','S2','C1','C1','C1','D1','E1','F1')
Sale <- c(10,15,5,20,15,10,5,5,5)
Profit <- c(2,4,2,6,8,2,4,6,8)
Sale_Cum <- c(10,25,5,20,35,45,5,5,5)
Profit_Cum <- c(2,6,2,6,14,16,4,6,8)
Data <- data.frame(Product,Product_Cat,Sale,Profit)
Desired_Data <- data.frame(Product,Product_Cat,Sale,Profit,Sale_Cum,Profit_Cum)
Upvotes: 2
Views: 858
Reputation: 83
Here is some pretty poor code that does everything step by step
#sample data
d<-sample(1:10)
f<-sample(1:10)
p<-c("f","f","f","f","q","q","q","w","w","w")
pc<-c("c","c","d","d","d","v","v","v","b","b")
cc<-data.table(p,pc,d,f)
#storing the values that are overwritten first.
three<-cc[,3]
four<- cc[,4]
#applying your function
dt<-setDT(c)[,lapply(.SD,cumsum), by=.(p,pc)]
#binding the stored values to your function and renaming everything.
x<-cbind(dt,three,four)
colnames(x)[5]<-"sale"
colnames(x)[6]<-"profit"
colnames(x)[4]<-"CumSale"
colnames(x)[3]<-"CumProfit"
#reordering the columns
xx<-x[,c("p","pc","profit","sale","CumSale","CumProfit")]
xx
Upvotes: 0
Reputation: 28705
library(data.table)
setDT(Data)
cols <- names(Data)[3:4]
Data[, paste0(cols, '_cumsum') := lapply(.SD, cumsum)
, by = .(Product, Product_Cat)
, .SDcols = cols]
Upvotes: 1
Reputation: 1015
Data:
structure(list(Product = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Product_Cat = structure(c(5L,
5L, 6L, 1L, 1L, 1L, 2L, 3L, 4L), .Label = c("C1", "D1", "E1",
"F1", "S1", "S2"), class = "factor"), Sale = c(10L, 15L, 5L,
20L, 15L, 10L, 5L, 5L, 5L), Profit = c(2L, 4L, 2L, 6L, 8L, 2L,
4L, 6L, 8L), Sale_Cum = c(10, 25, 5, 20, 35, 45, 5, 5, 5), Profit_Cum = c(2,
6, 2, 6, 14, 16, 4, 6, 8)), .Names = c("Product", "Product_Cat",
"Sale", "Profit", "Sale_Cum", "Profit_Cum"), row.names = c(NA,
-9L), class = "data.frame")`
We can iteratively slice the dataframe based on Product
and Product_Cat
, and for each iteration, assign the output produced by cumsum()
to Sale_Cum
and Product_Cum
:
cols <- c('Sale', 'Profit')
for (column in cols){
x[, paste0(column, '_Cum')] <- 0
for(p in unique(x$Product)){
for (pc in unique(x$Product_Cat)){
x[x$Product == p & x$Product_Cat == pc, paste0(column, '_Cum')] <- cumsum(x[x$Product == p & x$Product_Cat == pc, column])
}
}
}
print(x)
# Product Product_Cat Sale Profit Sale_Cum Profit_Cum
# 1 A S1 10 2 10 2
# 2 A S1 15 4 25 6
# 3 A S2 5 2 5 2
# 4 B C1 20 6 20 6
# 5 B C1 15 8 35 14
# 6 B C1 10 2 45 16
# 7 C D1 5 4 5 4
# 8 C E1 5 6 5 6
# 9 C F1 5 8 5 8
Upvotes: 1
Reputation: 2369
This doesn't use the group by per se but I think it achieves what you're looking for in that it is easily extensible to many columns:
D2 <- data.frame(lapply(Data[,c(3,4)], cumsum))
names(D2) <- gsub("$", "_cum", names(Data[,c(3,4)]))
Data <- cbind(Data, D2)
If you have 40+ columns just change the c(3,4) to include all the columns you're after.
EDIT:
I forgot that the OP wanted it to reset for each category. In that case, you can modify your original code:
DT <- setDT(Data)[,lapply(.SD, cumsum), by = .(Product,Product_Cat) ]
names(D2)[c(-1,-2)] <- gsub("$", "_cum", names(Data)[c(-1,-2)])
cbind(Data, D2[,c(-1,-2)])
Upvotes: 2