Reputation: 13
I am trying to find historical consecutive multi year sales peak of items. My problem is that some items were sold in the past and discontinued, but still need to be part of the analysis. For example:
I've worked through some for loops in r, however I am unsure how to tackle the issue of summing up the multiple consecutive years and also comparing it against other local maxima within the same dataset.
Year Item Sales
2001 Trash Can 100
2002 Trash Can 125
2003 Trash Can 90
2004 Trash Can 97
2002 Red Balloon 23
2003 Red Balloon 309
2004 Red Balloon 67
2005 Red Balloon 8
1998 Blue Bottle 600
1999 Blue Bottle 565
Based on the above data, if I wanted to calculate the 2 year peak of sales, I would want to output Blue Bottle 1165(sum of 1998 and 1999), Red Balloon 376(sum of 2003 and 2004) and Trash Can 225(sum of 2001 and 2002). However, if I wanted a 3 year peak, Blue bottle would be ineligible because it only has 2 years of data.
If I wanted to calculate the 3 year peak of sales, I would want to output Red Balloon 399(sum of 2002 to 2004) and Trash Can 315(sum of 2001 to 2003).
Upvotes: 1
Views: 100
Reputation: 269852
Read the data dat
(shown reproducibly in the Note at the end) into a zoo series with one column per Item
and then convert to a ts series tt
(which will fill in the missing years with NA). Then use rollsumr
to take the sums of every consecutive k
years for each Item
, find the maximum value for each Item
, stack that into a data frame and omit any NA rows. The function Max
is like max(x, na.rm = TRUE)
except that if x is all NAs it returns NA instead of -Inf and does not issue a warning. stack
outputs the item column second so reverse the columns using 2:1 and add nicer names.
library(zoo)
Max <- function(x) if (all(is.na(x))) NA else max(x, na.rm = TRUE)
peak <- function(data, k) {
tt <- as.ts(read.zoo(data, split = "Item"))
s <- na.omit(stack(apply(rollsumr(tt, k), 2, Max)))
setNames(s[2:1], c("Item", "Sum"))
}
peak(dat, 2)
## Item Sum
## 1 Blue Bottle 1165
## 2 Red Balloon 376
## 3 Trash Can 225
peak(dat, 3)
## Item Sum
## 2 Red Balloon 399
## 3 Trash Can 315
The input in reproducible form is assumed to be:
dat <-
structure(list(Year = c(2001L, 2002L, 2003L, 2004L, 2002L, 2003L,
2004L, 2005L, 1998L, 1999L), Item = c("Trash Can", "Trash Can",
"Trash Can", "Trash Can", "Red Balloon", "Red Balloon", "Red Balloon",
"Red Balloon", "Blue Bottle", "Blue Bottle"), Sales = c(100L,
125L, 90L, 97L, 23L, 309L, 67L, 8L, 600L, 565L)), row.names = c(NA,
-10L), class = "data.frame")
Upvotes: 0
Reputation: 10233
A solution in R using the tidyverse
and RcppRoll
:
#Loading the packages and your data as a `tibble`
library("RcppRoll")
library("dplyr")
tbl <- tribble(
~Year, ~Item, ~Sales,
2001, "Trash Can", 100,
2002, "Trash Can", 125,
2003, "Trash Can", 90,
2004, "Trash Can", 97,
2002, "Red Balloon", 23,
2003, "Red Balloon", 309,
2004, "Red Balloon", 67,
2005, "Red Balloon", 8,
1998, "Blue Bottle", 600,
1999, "Blue Bottle", 565
)
# Set the number of consecutive years
n <- 2
# Compute the rolling sums (assumes data to be sorted) and take max
res <- tbl %>%
group_by(Item) %>%
mutate(rollingsum = roll_sumr(Sales, n)) %>%
summarize(best_sum = max(rollingsum, na.rm = TRUE))
print(res)
## A tibble: 3 x 2
# Item best_sum
# <chr> <dbl>
#1 Blue Bottle 1165
#2 Red Balloon 376
#3 Trash Can 225
Setting n <- 3
yields a different res
:
print(res)
## A tibble: 3 x 2
# Item best_sum
# <chr> <dbl>
#1 Blue Bottle -Inf
#2 Red Balloon 399
#3 Trash Can 315
Upvotes: 0
Reputation: 772
I only can help you with the SQL
part; Use GROUP BY
with HAVING
. With HAVIG
it will be filtered out all items without an specified minimum number of historical data-years.
Check if this query adjusts your requirements.
SELECT
item
, count(*) as num_years
, sum(Sales) as local_max
from [your_table]
where year between [year_ini] and [year_end]
group by item
having count(*) >= [number_of_years]
Upvotes: 0
Reputation: 1270391
In SQL, you can use window functions. For eligible 2 year sales:
select item, sales, year
from (select t.*,
sum(sales) over (partition by item order by year rows between 1 preceding and current row) as two_year_sales,
row_number() over (partition by item order by year) as seqnum
from t
) t
where seqnum >= 2;
And to get the peak:
select t.*
from (select item, two_year_sales, year,
max(two_year_sales) over (partition by item) as max_two_year_sales
from (select t.*,
sum(sales) over (partition by item order by year rows between 1 preceding and current row) as two_year_sales,
row_number() over (partition by item order by year) as seqnum
from t
) t
where seqnum >= 2
) t
where two_year_sales = max_two_year_sales;
Upvotes: 0