Reputation: 15
I have a data frame composed of four columns: state A; state B; imports of state B from state A; and year. It has 594473 rows.
For each state B, want to calculate the total imports B with all possible partners in a given year and the 4 years before that.
To that end, I created the following for loop, where "j" is the country B in a given row, year is the year of that row, and z[,5] is the imports of country b from a on that year:
# create column for 5 year total imports of b
for (row in 1:nrow(z)){
j <-z[row,3]
year<-as.numeric(paste(z[row,1]))
z[row,6]<- sum(z[z[,3]==j & z[,1]==year,5])+
sum(z[z[,3]==j & z[,1]==year-1,5])+
sum(z[z[,3]==j & z[,1]==year-2,5])+
sum(z[z[,3]==j & z[,1]==year-3,5])+
sum(z[z[,3]==j & z[,1]==year-4,5])
}
colnames(z)[6]<-"5year_imp_b
I works, but takes ages, given the size of the data. My computer can't finish it. Is there a faster way to do that?
Sorry if that is not a good question, I'm new to R and programming in general.
Many thanks!
Upvotes: 1
Views: 39
Reputation: 14764
Here is a function that should be relatively quick (you will need to import data.table
for it to work):
calculate_rolling_sum <- function(df, date_col, calc_col, id_var, k) {
return(setDT(df)[order(get(date_col)),][, paste(calc_col, "roll_sum", k, sep = "_") := sapply(get(date_col), function(x) sum(get(calc_col)[between(get(date_col), x - k, x)])),
by = mget(id_var)])
}
Example dataframe:
df <- data.frame(
state_A = c(rep("x", 6), rep("y", 4), rep("z", 6)),
state_B = c(rep("d", 16)),
imports_AB = c(rep(3, 3), rep(4, 4), rep(5, 2), rep(6, 2), rep(9, 3), rep(3, 2)),
yr = c(seq(2000, 2006, 1), seq(2009, 2017, 1))
)
state_A state_B imports_AB yr
1: x d 3 2000
2: x d 3 2001
3: x d 3 2002
4: x d 4 2003
5: x d 4 2004
6: x d 4 2005
7: y d 4 2006
8: y d 5 2009
9: y d 5 2010
10: y d 6 2011
11: z d 6 2012
12: z d 9 2013
13: z d 9 2014
14: z d 9 2015
15: z d 3 2016
16: z d 3 2017
Applying the function for current and the past 3 years and the new dataframe:
library(data.table)
df_rolling <- calculate_rolling_sum(df, date_col = "yr", calc_col = "imports_AB", id_var = c("state_A", "state_B"), k = 3)
df_rolling[]
state_A state_B imports_AB yr imports_AB_roll_sum_3
1: x d 3 2000 3
2: x d 3 2001 6
3: x d 3 2002 9
4: x d 4 2003 13
5: x d 4 2004 14
6: x d 4 2005 15
7: y d 4 2006 4
8: y d 5 2009 9
9: y d 5 2010 10
10: y d 6 2011 16
11: z d 6 2012 6
12: z d 9 2013 15
13: z d 9 2014 24
14: z d 9 2015 33
15: z d 3 2016 30
16: z d 3 2017 24
What is the advantage of this function over standard rolling functions? For instance, in 2010 it won't take into account 2006 anymore, since this is not the requirement.
Usual rolling functions that count only by row indices would count it (as it is 2 rows below).
In this way, you don't need to care whether you have gap between years, and there is no need to complete the dataset.
Upvotes: 1