Lucas Paes
Lucas Paes

Reputation: 15

Is there a faster way of processing sum by more the one level of a factor in R?

I have a data frame composed of four columns: state A; state B; imports of state B from state A; and year. It has 594473 rows.

For each state B, want to calculate the total imports B with all possible partners in a given year and the 4 years before that.

To that end, I created the following for loop, where "j" is the country B in a given row, year is the year of that row, and z[,5] is the imports of country b from a on that year:

# create column for 5 year total imports of b



for (row in 1:nrow(z)){
  j   <-z[row,3]
  year<-as.numeric(paste(z[row,1]))
  z[row,6]<- sum(z[z[,3]==j & z[,1]==year,5])+  
    sum(z[z[,3]==j & z[,1]==year-1,5])+
    sum(z[z[,3]==j & z[,1]==year-2,5])+
    sum(z[z[,3]==j & z[,1]==year-3,5])+
    sum(z[z[,3]==j & z[,1]==year-4,5])
}

colnames(z)[6]<-"5year_imp_b

I works, but takes ages, given the size of the data. My computer can't finish it. Is there a faster way to do that?

Sorry if that is not a good question, I'm new to R and programming in general.

Many thanks!

Upvotes: 1

Views: 39

Answers (1)

arg0naut91
arg0naut91

Reputation: 14764

Here is a function that should be relatively quick (you will need to import data.table for it to work):

calculate_rolling_sum <- function(df, date_col, calc_col, id_var, k) {

  return(setDT(df)[order(get(date_col)),][, paste(calc_col, "roll_sum", k, sep = "_") := sapply(get(date_col), function(x) sum(get(calc_col)[between(get(date_col), x - k, x)])),
                                                                 by = mget(id_var)])

}

Example dataframe:

df <- data.frame(
  state_A = c(rep("x", 6), rep("y", 4), rep("z", 6)),
  state_B = c(rep("d", 16)),
  imports_AB = c(rep(3, 3), rep(4, 4), rep(5, 2), rep(6, 2), rep(9, 3), rep(3, 2)),
  yr = c(seq(2000, 2006, 1), seq(2009, 2017, 1))
)

    state_A state_B imports_AB   yr
 1:       x       d          3 2000
 2:       x       d          3 2001
 3:       x       d          3 2002
 4:       x       d          4 2003
 5:       x       d          4 2004
 6:       x       d          4 2005
 7:       y       d          4 2006
 8:       y       d          5 2009
 9:       y       d          5 2010
10:       y       d          6 2011
11:       z       d          6 2012
12:       z       d          9 2013
13:       z       d          9 2014
14:       z       d          9 2015
15:       z       d          3 2016
16:       z       d          3 2017

Applying the function for current and the past 3 years and the new dataframe:

library(data.table)

df_rolling <- calculate_rolling_sum(df, date_col = "yr", calc_col = "imports_AB", id_var = c("state_A", "state_B"), k = 3)

df_rolling[]

    state_A state_B imports_AB   yr imports_AB_roll_sum_3
 1:       x       d          3 2000                     3
 2:       x       d          3 2001                     6
 3:       x       d          3 2002                     9
 4:       x       d          4 2003                    13
 5:       x       d          4 2004                    14
 6:       x       d          4 2005                    15
 7:       y       d          4 2006                     4
 8:       y       d          5 2009                     9
 9:       y       d          5 2010                    10
10:       y       d          6 2011                    16
11:       z       d          6 2012                     6
12:       z       d          9 2013                    15
13:       z       d          9 2014                    24
14:       z       d          9 2015                    33
15:       z       d          3 2016                    30
16:       z       d          3 2017                    24

What is the advantage of this function over standard rolling functions? For instance, in 2010 it won't take into account 2006 anymore, since this is not the requirement.

Usual rolling functions that count only by row indices would count it (as it is 2 rows below).

In this way, you don't need to care whether you have gap between years, and there is no need to complete the dataset.

Upvotes: 1

Related Questions