Reputation: 83
I have a table with non-consecutive dates, a categorical variable and a variable (price). I want the average price per category per day. But where it gets tricky is I want that average price to be the average price of the previous date that the category had entries.
Date | category | Price |
---|---|---|
10-12-2024 | Red | 0.9 |
10-12-2024 | Red | 0.92 |
10-12-2024 | Blue | 1.23 |
10-12-2024 | Blue | 1.12 |
10-17-2024 | Blue | 0.93 |
10-17-2024 | Blue | 1.14 |
10-19-2024 | Red | 0.99 |
10-19-2024 | Blue | 1.31 |
So I would like a 4th column of average price previous trading day
for that category.
For example, row 5 and 6 would have an entry of 1.175 (average of [3:4, 3]) for this new column.
Row 7 would have an entry of 0.91 (average of [1:2, 3]).
Row 8 would have an entry of 1.035 (average of [5:6, 3]).
Date | category | Price | Average price previous trading day |
---|---|---|---|
10-12-2024 | Red | 0.9 | |
10-12-2024 | Red | 0.92 | |
10-12-2024 | Blue | 1.23 | |
10-12-2024 | Blue | 1.12 | |
10-17-2024 | Blue | 0.93 | 1.175 |
10-17-2024 | Blue | 1.14 | 1.175 |
10-19-2024 | Red | 0.99 | 0.91 |
10-19-2024 | Blue | 1.31 | 1.035 |
I am a beginner coder so my attempts were limited to the unique() function and which(). I had no progress.
Upvotes: 8
Views: 417
Reputation: 7979
Might be a bit too particular to the toy data, haven't tested. With base R:
shift = \(x, k = 1, fill = NA) c(rep(fill, k), head(x, -k))
with(X, by(Price, list(Date, category), FUN = mean)) |>
array2DF() |>
setNames(c('Date', 'category', 'avg_price')) |>
na.omit() |>
sort_by(~Date) |>
transform(avg_price = ave(avg_price, category, FUN = shift)) |>
merge(X, y = _, by = c("Date", "category"))
Date category Price avg_price
1 10-12-2024 Blue 1.23 NA
2 10-12-2024 Blue 1.12 NA
3 10-12-2024 Red 0.90 NA
4 10-12-2024 Red 0.92 NA
5 10-17-2024 Blue 0.93 1.175
6 10-17-2024 Blue 1.14 1.175
7 10-19-2024 Blue 1.31 1.035
8 10-19-2024 Red 0.99 0.910
where X
is
X = data.frame(
Date = rep(c("10-12-2024", "10-17-2024", "10-19-2024"), c(4L, 2L, 2L)),
category = c("Red", "Red", "Blue", "Blue", "Blue", "Blue", "Red", "Blue"),
Price = c(0.9, 0.92, 1.23, 1.12, 0.93, 1.14, 0.99, 1.31)
)
Upvotes: 2
Reputation: 270045
Grouping by category, calculate the dense rank which gives 1 for all the first dates, 2 for the second and so on. Then take the average of the prices of the rows whose dense rank is one less than the current row. Finally remove the rank
temporary variable. The coalesce
is to convert NaN to NA. It can be omitted if NaN is ok. Note that this is strictly left to right so that if we need to change the name of DF
it only has to be done in one place and that is at the start of the pipeline. It is also reasonably compact consisting of a single mutate
.
library(dplyr)
library(purrr)
DF %>%
mutate(rank = dense_rank(as.Date(Date, "%m-%d-%Y")),
avg_price = coalesce(map_dbl(rank, ~ mean(Price[.x == rank + 1]))),
rank = NULL, .by = category)
giving
Date category Price avg_price
1 10-12-2024 Red 0.90 NA
2 10-12-2024 Red 0.92 NA
3 10-12-2024 Blue 1.23 NA
4 10-12-2024 Blue 1.12 NA
5 10-17-2024 Blue 0.93 1.175
6 10-17-2024 Blue 1.14 1.175
7 10-19-2024 Red 0.99 0.910
8 10-19-2024 Blue 1.31 1.035
The input data in reproducible form:
DF <- data.frame(
Date = rep(c("10-12-2024", "10-17-2024", "10-19-2024"), c(4L, 2L, 2L)),
category = c("Red", "Red", "Blue", "Blue", "Blue", "Blue", "Red", "Blue"),
Price = c(0.9, 0.92, 1.23, 1.12, 0.93, 1.14, 0.99, 1.31)
)
Upvotes: 6
Reputation: 29153
This is similar to the other answer, but avoids ..._join
which can cause higher computational cost for larger datasets;
library(dplyr)
df %>%
summarise(avg = mean(Price),
Price = list(Price),
.by = c(Date, category)) %>%
mutate(`Average price previous trading day` = lag(avg),
.by = category, .keep = "unused") %>%
tidyr::unnest(Price)
#> # A tibble: 8 × 4
#> Date category Price `Average price previous trading day`
#> <chr> <chr> <dbl> <dbl>
#> 1 10-12-2024 Red 0.9 NA
#> 2 10-12-2024 Red 0.92 NA
#> 3 10-12-2024 Blue 1.23 NA
#> 4 10-12-2024 Blue 1.12 NA
#> 5 10-17-2024 Blue 0.93 1.18
#> 6 10-17-2024 Blue 1.14 1.18
#> 7 10-19-2024 Red 0.99 0.91
#> 8 10-19-2024 Blue 1.31 1.03
Created on 2025-01-14 with reprex v2.0.2
Upvotes: 5
Reputation: 17656
There may be more elegant ways, but one approach is to summarize
each day, the use lag
to grab the previous day, and *_join
back with your original data:
library(dplyr)
df %>%
summarise(avg_price = mean(Price),
.by = c(Date, category)) %>%
mutate(prev_average = lag(avg_price), .by = category) %>%
right_join(df)
Output:
# Date category avg_price prev_average Price
# 1 10-12-2024 Red 0.910 NA 0.90
# 2 10-12-2024 Red 0.910 NA 0.92
# 3 10-12-2024 Blue 1.175 NA 1.23
# 4 10-12-2024 Blue 1.175 NA 1.12
# 5 10-17-2024 Blue 1.035 1.175 0.93
# 6 10-17-2024 Blue 1.035 1.175 1.14
# 7 10-19-2024 Red 0.990 0.910 0.99
# 8 10-19-2024 Blue 1.310 1.035 1.31
Note the I kept the intermediate variable avg_price
(which reflects a give day's average price) in there for clarity, but can be removed by adding in ... %>% select(-arg_price)
at the end.
Data:
df <- read.table(text = "Date category Price
10-12-2024 Red 0.9
10-12-2024 Red 0.92
10-12-2024 Blue 1.23
10-12-2024 Blue 1.12
10-17-2024 Blue 0.93
10-17-2024 Blue 1.14
10-19-2024 Red 0.99
10-19-2024 Blue 1.31", header = TRUE)
Upvotes: 4