Reputation: 13

How to sum on different intervals to find multi year peaks

I am trying to find historical consecutive multi year sales peak of items. My problem is that some items were sold in the past and discontinued, but still need to be part of the analysis. For example:

I've worked through some for loops in r, however I am unsure how to tackle the issue of summing up the multiple consecutive years and also comparing it against other local maxima within the same dataset.

Year      Item            Sales
2001      Trash Can       100
2002      Trash Can       125
2003      Trash Can       90
2004      Trash Can       97
2002      Red Balloon     23
2003      Red Balloon     309
2004      Red Balloon     67
2005      Red Balloon     8
1998      Blue Bottle     600
1999      Blue Bottle     565

Based on the above data, if I wanted to calculate the 2 year peak of sales, I would want to output Blue Bottle 1165(sum of 1998 and 1999), Red Balloon 376(sum of 2003 and 2004) and Trash Can 225(sum of 2001 and 2002). However, if I wanted a 3 year peak, Blue bottle would be ineligible because it only has 2 years of data.

If I wanted to calculate the 3 year peak of sales, I would want to output Red Balloon 399(sum of 2002 to 2004) and Trash Can 315(sum of 2001 to 2003).

Upvotes: 1

Answers (4)

G. Grothendieck

Reputation: 269852

Read the data dat (shown reproducibly in the Note at the end) into a zoo series with one column per Item and then convert to a ts series tt (which will fill in the missing years with NA). Then use rollsumr to take the sums of every consecutive k years for each Item, find the maximum value for each Item, stack that into a data frame and omit any NA rows. The function Max is like max(x, na.rm = TRUE) except that if x is all NAs it returns NA instead of -Inf and does not issue a warning. stack outputs the item column second so reverse the columns using 2:1 and add nicer names.

library(zoo)

Max <- function(x) if (all(is.na(x))) NA else max(x, na.rm = TRUE)

peak <- function(data, k) {
  tt <- as.ts(read.zoo(data, split = "Item"))
  s <- na.omit(stack(apply(rollsumr(tt, k), 2, Max)))
  setNames(s[2:1], c("Item", "Sum"))
}

peak(dat, 2)
##          Item  Sum
## 1 Blue Bottle 1165
## 2 Red Balloon  376
## 3   Trash Can  225

peak(dat, 3)
##          Item Sum
## 2 Red Balloon 399
## 3   Trash Can 315

Note

The input in reproducible form is assumed to be:

dat <- 
structure(list(Year = c(2001L, 2002L, 2003L, 2004L, 2002L, 2003L, 
2004L, 2005L, 1998L, 1999L), Item = c("Trash Can", "Trash Can", 
"Trash Can", "Trash Can", "Red Balloon", "Red Balloon", "Red Balloon", 
"Red Balloon", "Blue Bottle", "Blue Bottle"), Sales = c(100L, 
125L, 90L, 97L, 23L, 309L, 67L, 8L, 600L, 565L)), row.names = c(NA, 
-10L), class = "data.frame")

Upvotes: 0

Anders Ellern Bilgrau

Reputation: 10233

A solution in R using the tidyverse and RcppRoll:

#Loading the packages and your data as a `tibble`
library("RcppRoll")
library("dplyr")

tbl <- tribble(
  ~Year,     ~Item,          ~Sales,
  2001,      "Trash Can",       100,
  2002,      "Trash Can",       125,
  2003,      "Trash Can",       90,
  2004,      "Trash Can",       97,
  2002,      "Red Balloon",     23,
  2003,      "Red Balloon",     309,
  2004,      "Red Balloon",      67,
  2005,      "Red Balloon",     8,
  1998,      "Blue Bottle",     600,
  1999,      "Blue Bottle",     565
)

# Set the number of consecutive years
n <- 2

# Compute the rolling sums (assumes data to be sorted) and take max
res <- tbl %>% 
  group_by(Item) %>% 
  mutate(rollingsum = roll_sumr(Sales, n)) %>% 
  summarize(best_sum = max(rollingsum, na.rm = TRUE))
print(res)
## A tibble: 3 x 2
#  Item        best_sum
#  <chr>          <dbl>
#1 Blue Bottle     1165
#2 Red Balloon      376
#3 Trash Can        225

Setting n <- 3 yields a different res:

print(res)
## A tibble: 3 x 2
#  Item        best_sum
#  <chr>          <dbl>
#1 Blue Bottle     -Inf
#2 Red Balloon      399
#3 Trash Can        315

Upvotes: 0

SNR

Reputation: 772

I only can help you with the SQL part; Use GROUP BY with HAVING. With HAVIG it will be filtered out all items without an specified minimum number of historical data-years.

Check if this query adjusts your requirements.

SELECT 
     item
     , count(*) as num_years
     , sum(Sales) as local_max 
from [your_table] 
where year between [year_ini] and [year_end]
group by item 
having count(*) >= [number_of_years]

Upvotes: 0

Gordon Linoff

Reputation: 1270391

In SQL, you can use window functions. For eligible 2 year sales:

    select item, sales, year
    from (select t.*,
                 sum(sales) over (partition by item order by year rows between 1 preceding and current row) as two_year_sales,
                 row_number() over (partition by item order by year) as seqnum
          from t
         ) t
    where seqnum >= 2;

And to get the peak:

select t.*   
from (select item, two_year_sales, year,
             max(two_year_sales) over (partition by item) as max_two_year_sales
      from (select t.*,
                   sum(sales) over (partition by item order by year rows between 1 preceding and current row) as two_year_sales,
                   row_number() over (partition by item order by year) as seqnum
            from t
           ) t
      where seqnum >= 2
     ) t
where two_year_sales = max_two_year_sales;

Upvotes: 0

How to sum on different intervals to find multi year peaks

Answers (4)

Note

Related Questions