Reputation: 1
I have several individual csv-files on specific country pairs and their trade volumes for the years 1870-2020 (using the COW trade dataset, smoothtotrade variable here). Unfortunately, the dataset is only available until 2014, so all other values are set NA.
After trying several things to impute/forecast the missing data, I've decided it might be best to just carry forward the last available value (i.e., smoothtotrade in 2014). However, I can't get it to work. I've been using the imputeTS package here, using the na_locf function. Can someone help me out?
The list of data frames is called data_frames. My current code:
library(imputeTS)
*Imputation function using carry forward of the average of the last three non-missing values*
impute_smoothtotrade <- function(ts_data) {
ts_data_imputed <- na.locf(ts_data, option = "locf")
return(ts_data_imputed)
}
*Loop through each data frame (time series) in the list*
for (i in seq_along(data_frames)) {
data_frames[[i]]$smoothtotrade <- impute_smoothtotrade(data_frames[[i]]$smoothtotrade)
}
this is the result of a random country pair, showing clearly that the 2014 value was evidently not carried forward as intended.
51 AUT CMR 2010 11.484859
52 AUT CMR 2011 10.393110
53 AUT CMR 2012 6.902980
54 AUT CMR 2013 4.058900
55 AUT CMR 2014 9.018300
89 AUT CMR 2015 2.582298
90 AUT CMR 2016 2.582298
91 AUT CMR 2017 2.582298
92 AUT CMR 2018 2.582298
93 AUT CMR 2019 2.582298
94 AUT CMR 2020 2.582298
Upvotes: 0
Views: 53
Reputation: 7205
Two (of many) options:
Sample data
# Sample dataframes and data_frame list
df1 <- data.frame(country = c(rep("AAA", 11)), year = 2010:2020,
smoothtotrade = c(11.484859, 10.393110, 6.902980, 4.058900, 9.018300, rep(NA, 6)))
df2 <- data.frame(country = c(rep("BBB", 11)), year = 2010:2020,
smoothtotrade = c(12.484859, 1.393110, 3.902980, 8.058900, 5.018300, rep(NA, 6)))
df3 <- data.frame(country = c(rep("CCC", 11)), year = 2010:2020,
smoothtotrade = c(8.484859, 9.393110, 10.902980, 9.058900, 8.018300, rep(NA, 6)))
data_frames <- list(df1, df2, df3)
Option 1: Using the dplyr
and tidyr
packages
library(dplyr)
library(tidyr)
# Single df with all dataframes
df4 <- bind_rows(data_frames, .id = "column_label")
result <- df4 %>%
group_by(country) %>%
fill(smoothtotrade, .direction = c("down")) %>%
ungroup()
result
# A tibble: 33 × 4
column_label country year smoothtotrade
<chr> <chr> <int> <dbl>
1 1 AAA 2010 11.5
2 1 AAA 2011 10.4
3 1 AAA 2012 6.90
4 1 AAA 2013 4.06
5 1 AAA 2014 9.02
6 1 AAA 2015 9.02
7 1 AAA 2016 9.02
8 1 AAA 2017 9.02
9 1 AAA 2018 9.02
10 1 AAA 2019 9.02
# ℹ 23 more rows
# ℹ Use `print(n = ...)` to see more rows
Option 2: Using your original method
for (i in seq_along(data_frames)) {
data_frames[[i]]$smoothtotrade <-
ifelse(is.na(data_frames[[i]]$smoothtotrade),
data_frames[[i]]$smoothtotrade[max(which(!is.na(data_frames[[i]]$smoothtotrade)))],
data_frames[[i]]$smoothtotrade)
}
data_frames
[[1]]
country year smoothtotrade
1 AAA 2010 11.48486
2 AAA 2011 10.39311
3 AAA 2012 6.90298
4 AAA 2013 4.05890
5 AAA 2014 9.01830
6 AAA 2015 9.01830
7 AAA 2016 9.01830
8 AAA 2017 9.01830
9 AAA 2018 9.01830
10 AAA 2019 9.01830
11 AAA 2020 9.01830
[[2]]
country year smoothtotrade
1 BBB 2010 12.48486
2 BBB 2011 1.39311
3 BBB 2012 3.90298
4 BBB 2013 8.05890
5 BBB 2014 5.01830
6 BBB 2015 5.01830
7 BBB 2016 5.01830
8 BBB 2017 5.01830
9 BBB 2018 5.01830
10 BBB 2019 5.01830
11 BBB 2020 5.01830
[[3]]
country year smoothtotrade
1 CCC 2010 8.484859
2 CCC 2011 9.393110
3 CCC 2012 10.902980
4 CCC 2013 9.058900
5 CCC 2014 8.018300
6 CCC 2015 8.018300
7 CCC 2016 8.018300
8 CCC 2017 8.018300
9 CCC 2018 8.018300
10 CCC 2019 8.018300
11 CCC 2020 8.018300
Upvotes: 0