Reputation: 15
I have two dataframes. One with relevant time horizons (in years) of observations and one with time series data of these observations. The time horizons differ for every observation. Moreover, data is not available for every year in the given time horizons. I want to calculate the mean of the time series data for every observation time horizon's available data. Additionally, I need to know from when on data is available for every observation.
So if these two data frames are given:
time_horizon <- data.frame(observation = c("A", "B", "C"),
start = c(2010, 2011, 2012),
end = c(2015, 2016, 2015)
)
data <- data.frame(Year = c(2010:2016),
A = c(NA, NA, 19485, 19485, 19456, 19454, 18475),
B = c(NA, NA, NA, 18762, 18567, 18321, 19001),
C = c(26374, 21134, 21212, 21111, 21344, 22222, 21345)
)
I need a data frame like this as a result:
calculated <- data.frame(observation = c("A", "B", "C"),
mean = c(19470, 18662.75, 21472.25),
data_start = c(2012, 2013, 2015),
data_end = c(2015, 2016, 2015)
)
I have tried several variations of colMeans(x, na.rm = TRUE) or apply(x,2,mean, na.rm = TRUE) but I could not come up with a solution that takes the varying time horizons into account. Also for the ex post indication of the available data time horizons I have no solution.
Any suggestions?
Upvotes: 1
Views: 45
Reputation: 30474
Here is one approach using fuzzyjoin
and tidyverse
. You can fuzzyjoin
your two tables together, where Year
falls between start
and end
. To match on observation
, you can put your data
into long form as well.
After removing missing NA
values, you can summarise
by observation
. The data_start
and data_end
years should be the min
and max
years left after removing missing observations.
library(fuzzyjoin)
library(tidyverse)
time_horizon %>%
fuzzy_inner_join(
data %>% pivot_longer(cols = -Year),
by = c("start" = "Year", "end" = "Year", "observation" = "name"),
match_fun = list(`<=`, `>=`, `==`)
) %>%
drop_na() %>%
group_by(observation) %>%
summarise(mean = mean(value),
data_start = min(Year),
data_end = max(Year))
Output
observation mean data_start data_end
<chr> <dbl> <int> <int>
1 A 19470 2012 2015
2 B 18663. 2013 2016
3 C 21472. 2012 2015
Upvotes: 1