jf65a
jf65a

Reputation: 15

How do I calculate means of time series data with different time horizons and NAs in R

I have two dataframes. One with relevant time horizons (in years) of observations and one with time series data of these observations. The time horizons differ for every observation. Moreover, data is not available for every year in the given time horizons. I want to calculate the mean of the time series data for every observation time horizon's available data. Additionally, I need to know from when on data is available for every observation.

So if these two data frames are given:


time_horizon <- data.frame(observation = c("A", "B", "C"), 
                           start = c(2010, 2011, 2012), 
                           end = c(2015, 2016, 2015)
                           )

data <- data.frame(Year = c(2010:2016), 
                   A = c(NA, NA, 19485, 19485, 19456, 19454, 18475), 
                   B = c(NA, NA, NA, 18762, 18567, 18321, 19001), 
                   C = c(26374, 21134, 21212, 21111, 21344, 22222, 21345)
                   )

I need a data frame like this as a result:

calculated <- data.frame(observation = c("A", "B", "C"),
                         mean = c(19470, 18662.75, 21472.25),
                         data_start = c(2012, 2013, 2015),
                         data_end = c(2015, 2016, 2015)
                         )

I have tried several variations of colMeans(x, na.rm = TRUE) or apply(x,2,mean, na.rm = TRUE) but I could not come up with a solution that takes the varying time horizons into account. Also for the ex post indication of the available data time horizons I have no solution.

Any suggestions?

Upvotes: 1

Views: 45

Answers (1)

Ben
Ben

Reputation: 30474

Here is one approach using fuzzyjoin and tidyverse. You can fuzzyjoin your two tables together, where Year falls between start and end. To match on observation, you can put your data into long form as well.

After removing missing NA values, you can summarise by observation. The data_start and data_end years should be the min and max years left after removing missing observations.

library(fuzzyjoin)
library(tidyverse)

time_horizon %>%
  fuzzy_inner_join(
    data %>% pivot_longer(cols = -Year),
    by = c("start" = "Year", "end" = "Year", "observation" = "name"),
    match_fun = list(`<=`, `>=`, `==`)
  ) %>%
  drop_na() %>%
  group_by(observation) %>%
  summarise(mean = mean(value),
            data_start = min(Year),
            data_end = max(Year))

Output

  observation   mean data_start data_end
  <chr>        <dbl>      <int>    <int>
1 A           19470        2012     2015
2 B           18663.       2013     2016
3 C           21472.       2012     2015

Upvotes: 1

Related Questions