maubin0316
maubin0316

Reputation: 11

Alternatives to MAPPLY in R

I have a data frame with the following:

1) Store 2) DayOfWeek
3) Date 4) Sales
5) Customers
6) Open 7) Promo
8) StateHoliday 9) SchoolHoliday
10) StoreType
11) Assortment
12) CompetitionDistance 13) CompetitionOpenSinceMonth
14) CompetitionOpenSinceYear
15) Promo2
16) Promo2SinceWeek 17) Promo2SinceYear 18) PromoInterval
19) CompanyDistanceBin
20) CompetitionOpenSinceDate
21) DaysSinceCompetionOpen

I am trying to calculate the Average Sales for the Previous Quarter based on the date (basically date - 3 months). But, I need to also subset based on DayOfWeek and Promo. I have written a function and am using mapply.

quarter.store.sales.func <- function(storeId, storeDate, dayofweekvar, promotion)
{   
    storeDate = as.Date(storeDate,"%Y-%m-%d")
    EndDate = ymd(as.Date(storeDate)) + ddays(-1)
    EndDate = as.Date(storeDate,"%Y-%m-%d")
    StartDate = ymd(storeDate + months(-3))
    StartDate = as.Date(StartDate)

    quarterStoresales <- subset(saleswithstore, Date >= StartDate & Date <= EndDate & Store == storeId & DayOfWeek == dayofweekvar & Promo == promotion)
    quarterSales = 0
    salesDf <- ddply(quarterStoresales,.(Store),summarize,avgSales=mean(Sales))  

    if (nrow(salesDf)>0)
      quarterSales = as.numeric(round(salesDf$avgSales,digits=0))     

    return(quarterSales)
}

saleswithstore$QuarterSales <- mapply(quarter.store.sales.func, saleswithstore$Store, saleswithstore$Date, saleswithstore$DayOfWeek, saleswithstore$Promo)

 head(exampleset)
           Store         DayOfWeek Date               Sales           Promo
186            1                3  2013-06-05         5012            1
296            1                3  2013-04-10         4903            1
337            1                3  2013-05-29         5784            1
425            1                3  2013-05-08         5230            0
449            1                3  2013-04-03         4625            0
477            1                3  2013-03-27         6660            1

saleswithstore is a dataframe that has 1,000,000 rows. So, this solution is not workable because it performing badly and taking forever. Is there a better, more efficient way to have a specific subset on a dataframe like this and then and then take an average like I am trying to do here?

I am open to any suggestions. I admittedly am new to R.

Upvotes: 0

Views: 594

Answers (1)

Tchotchke
Tchotchke

Reputation: 3121

@maubin0316, your intuition is right in the comment that you can just group by the rest of the variables. I put together this example using data.table

library(data.table)
set.seed(343)

# Create sample data
dt <- data.table('Store' = sample(1:10, 100, replace=T),
                 'DayOfWeek' = sample(1:7, 100, replace=T),
                 'Date' = sample(as.Date('2013-01-01'):as.Date('2013-06-30'), 100, replace=T),
                 'Sales' = sample(1000:10000, 100),
                 'Promo' = sample(c(0,1), 10, replace=T))

QuarterStartDate <- as.Date('2013-01-01')
QuarterEndDate <- as.Date('2013-03-31')

# Function to calculate your quarterly sales
QuarterlySales <- function(startDate, endDate, data){
  # Limit between your dates, group by your variables of interest
  data <- data[between(Date,startDate,endDate),list(TotalSales=sum(Sales)), by=list(Store,DayOfWeek,Promo)]
  # Sort in an order that makes sense
  data <- data[order(Store, DayOfWeek, Promo)]
  return(data)
}

salesSummary <- QuarterlySales(QuarterStartDate, QuarterEndDate, dt)
salesSummary

Upvotes: 1

Related Questions