Reputation: 83

How do I calculate the averages of values per date and category and display them lagged?

I have a table with non-consecutive dates, a categorical variable and a variable (price). I want the average price per category per day. But where it gets tricky is I want that average price to be the average price of the previous date that the category had entries.

Date	category	Price
10-12-2024	Red	0.9
10-12-2024	Red	0.92
10-12-2024	Blue	1.23
10-12-2024	Blue	1.12
10-17-2024	Blue	0.93
10-17-2024	Blue	1.14
10-19-2024	Red	0.99
10-19-2024	Blue	1.31

So I would like a 4th column of average price previous trading day for that category. For example, row 5 and 6 would have an entry of 1.175 (average of [3:4, 3]) for this new column. Row 7 would have an entry of 0.91 (average of [1:2, 3]). Row 8 would have an entry of 1.035 (average of [5:6, 3]).

Date	category	Price	Average price previous trading day
10-12-2024	Red	0.9
10-12-2024	Red	0.92
10-12-2024	Blue	1.23
10-12-2024	Blue	1.12
10-17-2024	Blue	0.93	1.175
10-17-2024	Blue	1.14	1.175
10-19-2024	Red	0.99	0.91
10-19-2024	Blue	1.31	1.035

I am a beginner coder so my attempts were limited to the unique() function and which(). I had no progress.

Upvotes: 8

Answers (4)

Friede

Reputation: 7979

Might be a bit too particular to the toy data, haven't tested. With base R:

shift = \(x, k = 1, fill = NA) c(rep(fill, k), head(x, -k))

with(X, by(Price, list(Date, category), FUN = mean)) |> 
  array2DF() |> 
  setNames(c('Date', 'category', 'avg_price')) |>
  na.omit() |> 
  sort_by(~Date) |>
  transform(avg_price = ave(avg_price, category, FUN = shift)) |>
  merge(X, y = _, by = c("Date", "category"))

        Date category Price avg_price
1 10-12-2024     Blue  1.23        NA
2 10-12-2024     Blue  1.12        NA
3 10-12-2024      Red  0.90        NA
4 10-12-2024      Red  0.92        NA
5 10-17-2024     Blue  0.93     1.175
6 10-17-2024     Blue  1.14     1.175
7 10-19-2024     Blue  1.31     1.035
8 10-19-2024      Red  0.99     0.910

where X is

X = data.frame(
  Date = rep(c("10-12-2024", "10-17-2024", "10-19-2024"), c(4L, 2L, 2L)),
  category = c("Red", "Red", "Blue", "Blue", "Blue", "Blue", "Red", "Blue"),
  Price = c(0.9, 0.92, 1.23, 1.12, 0.93, 1.14, 0.99, 1.31)
)

Upvotes: 2

G. Grothendieck

Reputation: 270045

Grouping by category, calculate the dense rank which gives 1 for all the first dates, 2 for the second and so on. Then take the average of the prices of the rows whose dense rank is one less than the current row. Finally remove the rank temporary variable. The coalesce is to convert NaN to NA. It can be omitted if NaN is ok. Note that this is strictly left to right so that if we need to change the name of DF it only has to be done in one place and that is at the start of the pipeline. It is also reasonably compact consisting of a single mutate.

library(dplyr)
library(purrr)

DF %>%
  mutate(rank = dense_rank(as.Date(Date, "%m-%d-%Y")), 
     avg_price = coalesce(map_dbl(rank, ~ mean(Price[.x == rank + 1]))),
     rank = NULL, .by = category)

giving

        Date category Price avg_price
1 10-12-2024      Red  0.90        NA
2 10-12-2024      Red  0.92        NA
3 10-12-2024     Blue  1.23        NA
4 10-12-2024     Blue  1.12        NA
5 10-17-2024     Blue  0.93     1.175
6 10-17-2024     Blue  1.14     1.175
7 10-19-2024      Red  0.99     0.910
8 10-19-2024     Blue  1.31     1.035

Note

The input data in reproducible form:

DF <- data.frame(
  Date = rep(c("10-12-2024", "10-17-2024", "10-19-2024"), c(4L, 2L, 2L)),
  category = c("Red", "Red", "Blue", "Blue", "Blue", "Blue", "Red", "Blue"),
  Price = c(0.9, 0.92, 1.23, 1.12, 0.93, 1.14, 0.99, 1.31)
)

Upvotes: 6

M--

Reputation: 29153

This is similar to the other answer, but avoids ..._join which can cause higher computational cost for larger datasets;

library(dplyr)

df %>% 
  summarise(avg = mean(Price),
            Price = list(Price), 
            .by = c(Date, category)) %>% 
  mutate(`Average price previous trading day` = lag(avg), 
         .by = category, .keep = "unused") %>% 
  tidyr::unnest(Price)

#> # A tibble: 8 × 4
#>   Date       category Price `Average price previous trading day`
#>   <chr>      <chr>    <dbl>                                <dbl>
#> 1 10-12-2024 Red       0.9                                 NA   
#> 2 10-12-2024 Red       0.92                                NA   
#> 3 10-12-2024 Blue      1.23                                NA   
#> 4 10-12-2024 Blue      1.12                                NA   
#> 5 10-17-2024 Blue      0.93                                 1.18
#> 6 10-17-2024 Blue      1.14                                 1.18
#> 7 10-19-2024 Red       0.99                                 0.91
#> 8 10-19-2024 Blue      1.31                                 1.03

^{Created on 2025-01-14 with reprex v2.0.2}

Upvotes: 5

jpsmith

Reputation: 17656

There may be more elegant ways, but one approach is to summarize each day, the use lag to grab the previous day, and *_join back with your original data:

library(dplyr)

df %>%
  summarise(avg_price = mean(Price), 
            .by = c(Date, category)) %>%
  mutate(prev_average = lag(avg_price), .by = category) %>%
  right_join(df)

Output:

#         Date category avg_price prev_average Price
# 1 10-12-2024      Red     0.910           NA  0.90
# 2 10-12-2024      Red     0.910           NA  0.92
# 3 10-12-2024     Blue     1.175           NA  1.23
# 4 10-12-2024     Blue     1.175           NA  1.12
# 5 10-17-2024     Blue     1.035        1.175  0.93
# 6 10-17-2024     Blue     1.035        1.175  1.14
# 7 10-19-2024      Red     0.990        0.910  0.99
# 8 10-19-2024     Blue     1.310        1.035  1.31

Note the I kept the intermediate variable avg_price (which reflects a give day's average price) in there for clarity, but can be removed by adding in ... %>% select(-arg_price) at the end. Data:

df <- read.table(text = "Date   category    Price
10-12-2024  Red 0.9
10-12-2024  Red 0.92
10-12-2024  Blue    1.23
10-12-2024  Blue    1.12
10-17-2024  Blue    0.93
10-17-2024  Blue    1.14
10-19-2024  Red 0.99
10-19-2024  Blue    1.31", header = TRUE)

Upvotes: 4

How do I calculate the averages of values per date and category and display them lagged?

Answers (4)

Note

Related Questions