user12310746
user12310746

Reputation: 279

Data.table solution to calculate weekly average up to varying total number of weeks if date falls within a date interval

I have daily measures of environmental data. For each ID, I want to create weekly averages up to the total number of weeks for that specific ID if the daily measure falls within that week (inclusive of Date_start). The last week will have varying lengths for each ID. Each weekly average will be a new column for that ID.

Some data:

   O3   Date       ID Date_start Date_end  Total_weeks 
 1 21.1 1980-01-28 a 1980-01-22 1980-02-08  2.428571       
 2 27.3 1980-01-30 a 1980-01-22 1980-02-08  2.428571       
 3 23.8 1980-01-31 a 1980-01-22 1980-02-08  2.428571       
 4 29.5 1980-02-01 a 1980-01-22 1980-02-08  2.428571   
 5 23.8 1980-01-29 a 1980-01-22 1980-02-08  2.428571      
 6 27.1 1980-02-03 a 1980-01-22 1980-02-08  2.428571     
 7 31.6 1980-02-04 a 1980-01-22 1980-02-08  2.428571    
 8 25.8 1980-02-05 a 1980-01-22 1980-02-08  2.428571    
 9 31.2 1980-02-02 a 1980-01-22 1980-02-08  2.428571   
10 14.0 1980-02-07 a 1980-01-22 1980-02-08  2.428571  
11 19.1 1980-01-22 a 1980-01-22 1980-02-08  2.428571 
12 15.5 1980-01-24 a 1980-01-22 1980-02-08  2.428571  
 
13 15.6 1980-01-26 b 1980-01-26 1980-02-14. 2.714286
14 28.6 1980-01-27 b 1980-01-26 1980-02-14  2.714286
15 16.9 1980-02-04 b 1980-01-26 1980-02-14  2.714286
16 27.4 1980-02-05 b 1980-01-26 1980-02-14  2.714286
17 30.1 1980-02-06 b 1980-01-26 1980-02-14  2.714286
18 24.4 1980-02-10 b 1980-01-26 1980-02-14  2.714286
19 21.2 1980-01-30 b 1980-01-26 1980-02-14  2.714286
20 22.1 1980-02-11 b 1980-01-26 1980-02-14  2.714286
21 26.1 1980-02-13 b 1980-01-26 1980-02-14  2.714286
21 19.9 1980-02-14 b 1980-01-26 1980-02-14  2.714286


For example, ID "a" will have the following weekly averages:

Week 1: mean(O3) for 1980-01-22 <= Date < 1980-01-29
Week 2: mean(O3) for 1980-01-29 <= Date < 1980-02-05
Week 3: mean(O3) for 1980-02-05 <= Date < 1980-02-08

where Week 3 wouldn't be a full week and 1980-02-08 is the Date_end.

I'm trying to write a nested loop that loops through each unique ID in the dataframe, calculates a sequence of numbers depending on Total_weeks, and then loops through the sequence to calculate the weekly averages.

What I was thinking:


for (i in unique(ID)) {

  sequence[i] <- seq(from = 1, to = unique(Total_weeks[i]), by = 7) 
  
  for (ii in 1:length(sequence[i])) {
    
    week[ii] <- mean(O3[Date >= Date_start + first_number_in_sequence & Date < Date_start + next_number_in_sequence[i]])
    
  }
    
}

But I'm not sure how to write the code so that the weekly averages are calculated according to the correct number in the sequence. I also attempted to do this in dplyr so that I could group by the ID but am not sure how to create a different number of columns for each unique ID (since they have varying Total_weeks).

Any insight would be appreciated.

Upvotes: 1

Views: 302

Answers (1)

Ben
Ben

Reputation: 30474

Using tidyverse you can try the following.

First, group_by(ID) and then determine the week number of each row's date based on the start date.

Then, group_by both ID and the week number and calculate the mean O3 for each week. If you have missing data, you may consider na.rm = TRUE.

Finally, if you want columns for each week, and one row per ID, consider using pivot_wider.

Note that this ignores Date_end (assumes you are not interested in weeks with no data).

library(tidyverse)

df %>%
  group_by(ID) %>%
  mutate(week = floor(difftime(Date, Date_start, units = "weeks")) + 1) %>%
  group_by(ID, week) %>%
  summarise(weekly_mean = mean(O3)) %>%
  pivot_wider(id_cols = ID, names_from = "week", values_from = "weekly_mean", names_prefix = "week_")

I believe a data.table equivalent might be something like this:

library(data.table)

setDT(df)

dcast(df[, week := floor(difftime(Date, Date_start, units = "weeks")) + 1, by = .(ID)][,
  .(weekly_mean = mean(O3)), by = .(ID, week)], ID ~ paste("week", week, sep = "_"), value.var = "weekly_mean")

Output

  ID    week_1 week_2 week_3
  <chr>  <dbl>  <dbl>  <dbl>
1 a       18.6   27.8   19.9
2 b       21.8   24.8   23.1

Data

df <- structure(list(O3 = c(21.1, 27.3, 23.8, 29.5, 23.8, 27.1, 31.6, 
25.8, 31.2, 14, 19.1, 15.5, 15.6, 28.6, 16.9, 27.4, 30.1, 24.4, 
21.2, 22.1, 26.1, 19.9), Date = structure(c(3679, 3681, 3682, 
3683, 3680, 3685, 3686, 3687, 3684, 3689, 3673, 3675, 3677, 3678, 
3686, 3687, 3688, 3692, 3681, 3693, 3695, 3696), class = "Date"), 
    ID = c("a", "a", "a", "a", "a", "a", "a", "a", "a", "a", 
    "a", "a", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b"
    ), Date_start = structure(c(3673, 3673, 3673, 3673, 3673, 
    3673, 3673, 3673, 3673, 3673, 3673, 3673, 3677, 3677, 3677, 
    3677, 3677, 3677, 3677, 3677, 3677, 3677), class = "Date"), 
    Date_end = structure(c(3690, 3690, 3690, 3690, 3690, 3690, 
    3690, 3690, 3690, 3690, 3690, 3690, 3696, 3696, 3696, 3696, 
    3696, 3696, 3696, 3696, 3696, 3696), class = "Date"), Total_weeks = c(2.428571, 
    2.428571, 2.428571, 2.428571, 2.428571, 2.428571, 2.428571, 
    2.428571, 2.428571, 2.428571, 2.428571, 2.428571, 2.714286, 
    2.714286, 2.714286, 2.714286, 2.714286, 2.714286, 2.714286, 
    2.714286, 2.714286, 2.714286)), row.names = c(NA, -22L), class = "data.frame")

Upvotes: 1

Related Questions