Reputation: 279
I have daily measures of environmental data. For each ID, I want to create weekly averages up to the total number of weeks for that specific ID if the daily measure falls within that week (inclusive of Date_start). The last week will have varying lengths for each ID. Each weekly average will be a new column for that ID.
Some data:
O3 Date ID Date_start Date_end Total_weeks
1 21.1 1980-01-28 a 1980-01-22 1980-02-08 2.428571
2 27.3 1980-01-30 a 1980-01-22 1980-02-08 2.428571
3 23.8 1980-01-31 a 1980-01-22 1980-02-08 2.428571
4 29.5 1980-02-01 a 1980-01-22 1980-02-08 2.428571
5 23.8 1980-01-29 a 1980-01-22 1980-02-08 2.428571
6 27.1 1980-02-03 a 1980-01-22 1980-02-08 2.428571
7 31.6 1980-02-04 a 1980-01-22 1980-02-08 2.428571
8 25.8 1980-02-05 a 1980-01-22 1980-02-08 2.428571
9 31.2 1980-02-02 a 1980-01-22 1980-02-08 2.428571
10 14.0 1980-02-07 a 1980-01-22 1980-02-08 2.428571
11 19.1 1980-01-22 a 1980-01-22 1980-02-08 2.428571
12 15.5 1980-01-24 a 1980-01-22 1980-02-08 2.428571
13 15.6 1980-01-26 b 1980-01-26 1980-02-14. 2.714286
14 28.6 1980-01-27 b 1980-01-26 1980-02-14 2.714286
15 16.9 1980-02-04 b 1980-01-26 1980-02-14 2.714286
16 27.4 1980-02-05 b 1980-01-26 1980-02-14 2.714286
17 30.1 1980-02-06 b 1980-01-26 1980-02-14 2.714286
18 24.4 1980-02-10 b 1980-01-26 1980-02-14 2.714286
19 21.2 1980-01-30 b 1980-01-26 1980-02-14 2.714286
20 22.1 1980-02-11 b 1980-01-26 1980-02-14 2.714286
21 26.1 1980-02-13 b 1980-01-26 1980-02-14 2.714286
21 19.9 1980-02-14 b 1980-01-26 1980-02-14 2.714286
For example, ID "a" will have the following weekly averages:
Week 1: mean(O3) for 1980-01-22 <= Date < 1980-01-29
Week 2: mean(O3) for 1980-01-29 <= Date < 1980-02-05
Week 3: mean(O3) for 1980-02-05 <= Date < 1980-02-08
where Week 3 wouldn't be a full week and 1980-02-08 is the Date_end.
I'm trying to write a nested loop that loops through each unique ID in the dataframe, calculates a sequence of numbers depending on Total_weeks, and then loops through the sequence to calculate the weekly averages.
What I was thinking:
for (i in unique(ID)) {
sequence[i] <- seq(from = 1, to = unique(Total_weeks[i]), by = 7)
for (ii in 1:length(sequence[i])) {
week[ii] <- mean(O3[Date >= Date_start + first_number_in_sequence & Date < Date_start + next_number_in_sequence[i]])
}
}
But I'm not sure how to write the code so that the weekly averages are calculated according to the correct number in the sequence. I also attempted to do this in dplyr so that I could group by the ID but am not sure how to create a different number of columns for each unique ID (since they have varying Total_weeks).
Any insight would be appreciated.
Upvotes: 1
Views: 302
Reputation: 30474
Using tidyverse
you can try the following.
First, group_by(ID)
and then determine the week number of each row's date based on the start date.
Then, group_by
both ID and the week number and calculate the mean O3
for each week. If you have missing data, you may consider na.rm = TRUE
.
Finally, if you want columns for each week, and one row per ID
, consider using pivot_wider
.
Note that this ignores Date_end
(assumes you are not interested in weeks with no data).
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(week = floor(difftime(Date, Date_start, units = "weeks")) + 1) %>%
group_by(ID, week) %>%
summarise(weekly_mean = mean(O3)) %>%
pivot_wider(id_cols = ID, names_from = "week", values_from = "weekly_mean", names_prefix = "week_")
I believe a data.table
equivalent might be something like this:
library(data.table)
setDT(df)
dcast(df[, week := floor(difftime(Date, Date_start, units = "weeks")) + 1, by = .(ID)][,
.(weekly_mean = mean(O3)), by = .(ID, week)], ID ~ paste("week", week, sep = "_"), value.var = "weekly_mean")
Output
ID week_1 week_2 week_3
<chr> <dbl> <dbl> <dbl>
1 a 18.6 27.8 19.9
2 b 21.8 24.8 23.1
Data
df <- structure(list(O3 = c(21.1, 27.3, 23.8, 29.5, 23.8, 27.1, 31.6,
25.8, 31.2, 14, 19.1, 15.5, 15.6, 28.6, 16.9, 27.4, 30.1, 24.4,
21.2, 22.1, 26.1, 19.9), Date = structure(c(3679, 3681, 3682,
3683, 3680, 3685, 3686, 3687, 3684, 3689, 3673, 3675, 3677, 3678,
3686, 3687, 3688, 3692, 3681, 3693, 3695, 3696), class = "Date"),
ID = c("a", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b"
), Date_start = structure(c(3673, 3673, 3673, 3673, 3673,
3673, 3673, 3673, 3673, 3673, 3673, 3673, 3677, 3677, 3677,
3677, 3677, 3677, 3677, 3677, 3677, 3677), class = "Date"),
Date_end = structure(c(3690, 3690, 3690, 3690, 3690, 3690,
3690, 3690, 3690, 3690, 3690, 3690, 3696, 3696, 3696, 3696,
3696, 3696, 3696, 3696, 3696, 3696), class = "Date"), Total_weeks = c(2.428571,
2.428571, 2.428571, 2.428571, 2.428571, 2.428571, 2.428571,
2.428571, 2.428571, 2.428571, 2.428571, 2.428571, 2.714286,
2.714286, 2.714286, 2.714286, 2.714286, 2.714286, 2.714286,
2.714286, 2.714286, 2.714286)), row.names = c(NA, -22L), class = "data.frame")
Upvotes: 1