Samet Sökel
Samet Sökel

Reputation: 2670

Creating summary table by dividing values by the average of the observations in the surrounding area in data.table

The minimal sample of my data is;

library(data.table)

set.seed(1)

tbl <- data.table(store = c('A','A','A','A','A','A','A','B','B','B','B','B','B','C','C','C','C'),
                  year = c(2017,2017,2017,2017,2018,2018,2018,2017,2017,2017,2017,2017,2017,2017,2017,2017,2018),
                  week = c(12,13,15,16,2,3,4,18,19,20,22,24,25,1,2,3,2),
                  insert = sample(0:1,size = 17,replace = T),
                  demand = sample(200:250,size = 17,replace=T))

I'd like to create a summary table which is calculated in this way;

to calculate the effect where the insert column is 1, I must divide the demand value of the insert column by the average of the demand value of the nearest weeks within a year, which takes the value of the insert column zero around that week

For example, if I need to calculate for store A at 2017 and 3rd week, it must be:

rate <- 209 / mean(c(206,245))

but if the year has more than one inserts, like for store B at year 2017 it should be means of the rates (for week 19 and 20):

rate <- mean(224 / mean(c(240,236)), 245 / mean(c(240,236)))

and if I can't find two values around it, I have to calculate a ratio by dividing it by the only value I find. for example store C at year 2017:

rate <- 243 / 224

if I can find no value around the inserted row, I need to pass 1.

finally the summary table should look like;

desired_tbl <- data.table(store = c('A','A','B','C','C'),
                          year = c(2017,2018,2017,2017,2018),
                          rate = c(0.927,0.97,0.941,1.08,1))

desired_tbl

  store  year  rate
  <chr> <dbl> <dbl>
1 A      2017 0.927
2 A      2018 0.97 
3 B      2017 0.941
4 C      2017 1.08 
5 C      2018 1    

I can do all of them by writing for loops and tons of conditions but I'm looking for a vectorised way to accomplish. dplyr solutions are also welcome.

Thanks in advance.

Upvotes: 0

Views: 75

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270045

Replace demand values having insert values of 1 with NA giving d and then fill in d forwards and backwards by group taking the mean of the two directions and dividing into demand giving rate. Then keep only the insert 1 rows and aggregate rate by store/year group using mean on what is left. If any rate is not finite use 1.

library(dplyr)
library(zoo)

tbl %>%
  mutate(d = ifelse(insert == 1, NA, demand)) %>%
  group_by(store, year) %>%
  mutate(rate = demand / 
    rowMeans(cbind(na.locf0(d), na.locf0(d, fromLast = TRUE)), na.rm = TRUE)) %>%
  filter(insert == 1) %>%
  summarize(rate = mean(rate, na.rm = TRUE), .groups = "drop") %>%
  mutate(rate = ifelse(is.finite(rate), rate, 1))

giving:

# A tibble: 5 x 3
  store  year  rate
  <chr> <dbl> <dbl>
1 A      2017 0.927
2 A      2018 0.973
3 B      2017 0.985
4 C      2017 1.08 
5 C      2018 1    

Upvotes: 1

langtang
langtang

Reputation: 24845

Here is one possible way

  • First, identify the store-years, where the insert=1 rows are consecutive, and collapse them (i.e, take the average of those). In your example, there is only one like this (2017-B, weeks 19/20). Take these collapsed rows, and bind them back to the insert=0 rows, ordering back to store, year,week
tbl[, `:=`(demand=as.double(demand), insert_id=rleid(insert))]
tbl <- rbind(
  tbl[insert==0],
  tbl[insert==1][, lapply(.SD, mean, na.rm=T), by=.(store,year,insert_id)]
)[order(store,year,week)]
  • Next, get the values before and after the insert=1 rows, generate the mean (m) of those values by row, estimate the ratio of demand to m, and retain the insert=1 rows.
tbl[, c("v1","v2"):=shift(demand,c(-1,1)), by=.(store, year)]
tbl[, m:=mean(c(v1,v2), na.rm=T), by=1:nrow(tbl)]
tbl[, rate:=demand/m][insert==1,.(rate=mean(fifelse(is.na(rate),1,rate))), by=.(store,year)]

Output:

    store  year      rate
   <char> <num>     <num>
1:      A  2017 0.9268293
2:      A  2018 0.9727273
3:      B  2017 0.9852941
4:      C  2017 1.0848214
5:      C  2018 1.0000000

Its not identical as your desired output, because I believe you have miscalculated 2017-B.

Upvotes: 1

Related Questions