Reputation: 11
I have a data frame with the following:
1) Store
2) DayOfWeek
3) Date
4) Sales
5) Customers
6) Open
7) Promo
8) StateHoliday
9) SchoolHoliday
10) StoreType
11) Assortment
12) CompetitionDistance
13) CompetitionOpenSinceMonth
14) CompetitionOpenSinceYear
15) Promo2
16) Promo2SinceWeek
17) Promo2SinceYear
18) PromoInterval
19) CompanyDistanceBin
20) CompetitionOpenSinceDate
21) DaysSinceCompetionOpen
I am trying to calculate the Average Sales for the Previous Quarter based on the date (basically date - 3 months). But, I need to also subset based on DayOfWeek and Promo. I have written a function and am using mapply.
quarter.store.sales.func <- function(storeId, storeDate, dayofweekvar, promotion)
{
storeDate = as.Date(storeDate,"%Y-%m-%d")
EndDate = ymd(as.Date(storeDate)) + ddays(-1)
EndDate = as.Date(storeDate,"%Y-%m-%d")
StartDate = ymd(storeDate + months(-3))
StartDate = as.Date(StartDate)
quarterStoresales <- subset(saleswithstore, Date >= StartDate & Date <= EndDate & Store == storeId & DayOfWeek == dayofweekvar & Promo == promotion)
quarterSales = 0
salesDf <- ddply(quarterStoresales,.(Store),summarize,avgSales=mean(Sales))
if (nrow(salesDf)>0)
quarterSales = as.numeric(round(salesDf$avgSales,digits=0))
return(quarterSales)
}
saleswithstore$QuarterSales <- mapply(quarter.store.sales.func, saleswithstore$Store, saleswithstore$Date, saleswithstore$DayOfWeek, saleswithstore$Promo)
head(exampleset)
Store DayOfWeek Date Sales Promo
186 1 3 2013-06-05 5012 1
296 1 3 2013-04-10 4903 1
337 1 3 2013-05-29 5784 1
425 1 3 2013-05-08 5230 0
449 1 3 2013-04-03 4625 0
477 1 3 2013-03-27 6660 1
saleswithstore is a dataframe that has 1,000,000 rows. So, this solution is not workable because it performing badly and taking forever. Is there a better, more efficient way to have a specific subset on a dataframe like this and then and then take an average like I am trying to do here?
I am open to any suggestions. I admittedly am new to R.
Upvotes: 0
Views: 594
Reputation: 3121
@maubin0316, your intuition is right in the comment that you can just group by the rest of the variables. I put together this example using data.table
library(data.table)
set.seed(343)
# Create sample data
dt <- data.table('Store' = sample(1:10, 100, replace=T),
'DayOfWeek' = sample(1:7, 100, replace=T),
'Date' = sample(as.Date('2013-01-01'):as.Date('2013-06-30'), 100, replace=T),
'Sales' = sample(1000:10000, 100),
'Promo' = sample(c(0,1), 10, replace=T))
QuarterStartDate <- as.Date('2013-01-01')
QuarterEndDate <- as.Date('2013-03-31')
# Function to calculate your quarterly sales
QuarterlySales <- function(startDate, endDate, data){
# Limit between your dates, group by your variables of interest
data <- data[between(Date,startDate,endDate),list(TotalSales=sum(Sales)), by=list(Store,DayOfWeek,Promo)]
# Sort in an order that makes sense
data <- data[order(Store, DayOfWeek, Promo)]
return(data)
}
salesSummary <- QuarterlySales(QuarterStartDate, QuarterEndDate, dt)
salesSummary
Upvotes: 1