David_Rowie
David_Rowie

Reputation: 127

R - Speeding up loop

I wonder if this loop can be speeded up, and I wonder if you could help me with this wondering.

I've used all my tricks but I still think it has not been enough.

The problem is about substracting amounts from one year df from the next year df. I have 2 data frames:

df_2016 <- data.frame(
  subject = rep(1:195, 65*39),
  items = rep(1:39, 195*65),
  sub_items = rep(1:65, 195*39),
  value = sample(1:100000000,(195*65*39)),
  period = rep("2016",(195*65*39)))

   df_2016 <- df_2016[sample(1:(195*65*39),450000),] # See Reference "A" below

df_2017 <- data.frame(
  subject = rep(1:195, 65*39),
  items = rep(1:39, 195*65),
  sub_items = rep(1:65, 195*39),
  value = sample(1:100000000,(195*65*39)),
  period = rep("2017",(195*65*39)))

Briefly, there are 3 categorical variables and 1 numeric. One dataset has amounts from 2016, and the other from 2017.

I want to substract amounts of 2016 from amounts of 2017, of the same "subject" AND same "items" AND same "sub_items". (Reference A) There is no row duplicated, but it could happen that one row from 2017 has no pair from 2016; if it has its pair, it has only one.

I made this function with a "cascade subseting" than it has really speeded up my function but not enough:

Func_diff <- function (df_per = df_2017, df_it = df_2016){

  func_df <<- df_per[1,1:5]   # i create the df where i'm going to put the outputs

  y <- 1

  subject_v <- sort(unique(df_per$subject))


  # 0 # Loop over subject

  for (j in 1:length(subject_v)) {

    df_per_w <- df_per[df_per$subject == subject_v[j], ]
    df_it_w <- df_it[df_it$subject == subject_v[j], ]

    item_v <- sort(unique(df_per_w$items))


  # 2 #  . Loop1 over items

  for (w in 1:length(item_v)){

      sub_item_v <- sort(unique(df_per_w[df_per_w$items == item_v[w], 3]))


  # 3 # Loop over subitems

  for(z in 1:length(sub_item_v)){

      dfcara_per <- df_per_w[df_per_w$items == item_v[w] & df_per_w$sub_items == sub_item_v[z],]
      dfcara_it <- df_it_w[df_it_w$items == item_v[w] & df_it_w$sub_items == sub_item_v[z],]


  # 4 # Loop over selected rows: subject[j], items[w], sub_items[z]

  for (i in 1:nrow(dfcara_per)) {

          # Checks if that combination of subject, item and subitem, existed the year before:

  if(length(dfcara_it[dfcara_it$subject == dfcara_per[i,1] &
                              dfcara_it$items == dfcara_per[i,2] &
                              dfcara_it$sub_items == dfcara_per[i,3], 4]) != 0) {

  func_df[y,1:5] <<- c(

              dfcara_per[i,1:3],

              sum(dfcara_per[i,4] -
                    dfcara_it[
                      dfcara_it$subject == dfcara_per[i,1] &
                      dfcara_it$items == dfcara_per[i,2] &
                      dfcara_it$sub_items == dfcara_per[i,3], 
                              4]),

              dfcara_per[i,5]
                       )

          }else{

            func_df[y,1:5] <<- func_df[i,1:5] # If there is no data in 2016, the function saves the amount of 2017

          }

          y <- y + 1  

        } 
      }
    }
  }

} # Function . End

It works but it takes soooo long, and i can't understand why an easy operation takes so long.

Thank you in advance!

Upvotes: 0

Views: 51

Answers (1)

ekstroem
ekstroem

Reputation: 6191

Your problem is in the generation of your data. The construction of subject, items and sub_items does not result in unique combinations of subject, items, and sub_items so that violates your assumption that

There is no row duplicated,

I've tried to create another version of your data that has unique combinations of subject, items, and sub_items using the expand.grid() function.

DF2016 <- data.frame(expand.grid(1:195,1:65,1:39), 
                     value = sample(1:100000000,(195*65*39)),
                     period = rep("2016",(195*65*39)))

DF2016 <- DF2016[sample(1:(195*65*39),450000),]

DF2017 <- data.frame(expand.grid(1:195,1:65,1:39), 
                     value = sample(1:100000000,(195*65*39)),
                     period = rep("2017",(195*65*39)))

The variables Var1, Var2, and Var3 correspond to subject, items, and sub_items.

You can then generate a left join to merge the two dataset

library("dplyr")
res <- DF2017 %>% left_join(DF2016, by=c("Var1", "Var2", "Var3")) %>% 
       mutate(difference = value.x-value.y)

The 2017 values are denoted value.x and the values for 2016 are value.y. This yields

> head(res)
  Var1 Var2 Var3  value.x period.x  value.y period.y difference
1    1    1    1 94920289     2017 84618631     2016   10301658
2    2    1    1 31008444     2017 87524572     2016  -56516128
3    3    1    1 44687050     2017       NA     <NA>         NA
4    4    1    1 87458715     2017 83105988     2016    4352727
5    5    1    1 40977802     2017 22528409     2016   18449393
6    6    1    1 80460053     2017       NA     <NA>         NA

Upvotes: 2

Related Questions