Tim George
Tim George

Reputation: 11

Sales data aggregation in R

I have daily sales data for multiple products in three stores. It looks something like this:

item_id store_id category_id dept_id date event_name daiy_price
a tx_1 food 1 2012/12/24 6
a tx_1 food 1 2012/12/25 christmas 6
a tx_1 food 1 2012/12/26 6
a tx_1 food 1 2012/12/27 7
a tx_1 food 1 2012/12/28 7
b tx_1 food 1 2012/12/24 6
b tx_1 food 1 2012/12/25 christmas 6
b tx_1 food 1 2012/12/26 6
b tx_1 food 1 2012/12/27 7
b tx_1 food 1 2012/12/28 7

I want to round each week to the monday date using floor_date from lubridate package in R, then use group_by and summarise to compute weekly price average. However, i want each week to retain event_name if there is any during a particular week. Including event_name in group_by is not helpful since a day in a week may have an event but other days may not and they get grouped separately. Say each week starts on a monday and 2012/12/24 is a monday, how do i achieve this

item_id store_id category_id dept_id date event_name weekly_avg_price
a tx_1 food 1 2012/12/24 christmas 6.4
b tx_1 food 1 2012/12/24 christmas 6.4

Upvotes: 1

Views: 29

Answers (1)

Edward
Edward

Reputation: 19304

You can achieve your desired output for this toy example by filling in the empty event names and then summarising the weekly prices by event_name and taking the first of the other columns.

library(tidyverse)
library(lubridate)

mutate(df, week_start=floor_date(date, unit = "week", week_start = 1)) |> 
  group_by(item_id, store_id, category_id, dept_id) |>
  fill(event_name, .direction="downup") |>
  ungroup() |>
  summarise(store_id=first(store_id),
            category_id=first(category_id),
            event_name=first(event_name), 
            dept_id=first(dept_id),
            weekly_avg_price=mean(daiy_price, na.rm=TRUE), .by=c(item_id, week_start))

Gives:

# A tibble: 2 × 7
  item_id week_start store_id category_id event_name dept_id weekly_avg_price
  <chr>   <date>     <chr>    <chr>       <chr>        <dbl>            <dbl>
1 a       2012-12-24 tx_1     food        christmas        1              6.4
2 b       2012-12-24 tx_1     food        christmas        1              6.4

Data:

df <- structure(list(item_id = c("a", "a", "a", "a", "a", "b", "b", 
"b", "b", "b"), store_id = c("tx_1", "tx_1", "tx_1", "tx_1", 
"tx_1", "tx_1", "tx_1", "tx_1", "tx_1", "tx_1"), category_id = c("food", 
"food", "food", "food", "food", "food", "food", "food", "food", 
"food"), dept_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), date = structure(c(15698, 
15699, 15700, 15701, 15702, 15698, 15699, 15700, 15701, 15702
), class = "Date"), event_name = c(NA, "christmas", NA, NA, NA, 
NA, "christmas", NA, NA, NA), daiy_price = c(6, 6, 6, 7, 7, 6, 
6, 6, 7, 7)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", 
"data.frame"))

Upvotes: 1

Related Questions