Reputation: 127
I wonder if this loop can be speeded up, and I wonder if you could help me with this wondering.
I've used all my tricks but I still think it has not been enough.
The problem is about substracting amounts from one year df from the next year df. I have 2 data frames:
df_2016 <- data.frame(
subject = rep(1:195, 65*39),
items = rep(1:39, 195*65),
sub_items = rep(1:65, 195*39),
value = sample(1:100000000,(195*65*39)),
period = rep("2016",(195*65*39)))
df_2016 <- df_2016[sample(1:(195*65*39),450000),] # See Reference "A" below
df_2017 <- data.frame(
subject = rep(1:195, 65*39),
items = rep(1:39, 195*65),
sub_items = rep(1:65, 195*39),
value = sample(1:100000000,(195*65*39)),
period = rep("2017",(195*65*39)))
Briefly, there are 3 categorical variables and 1 numeric. One dataset has amounts from 2016, and the other from 2017.
I want to substract amounts of 2016 from amounts of 2017, of the same "subject" AND same "items" AND same "sub_items". (Reference A) There is no row duplicated, but it could happen that one row from 2017 has no pair from 2016; if it has its pair, it has only one.
I made this function with a "cascade subseting" than it has really speeded up my function but not enough:
Func_diff <- function (df_per = df_2017, df_it = df_2016){
func_df <<- df_per[1,1:5] # i create the df where i'm going to put the outputs
y <- 1
subject_v <- sort(unique(df_per$subject))
# 0 # Loop over subject
for (j in 1:length(subject_v)) {
df_per_w <- df_per[df_per$subject == subject_v[j], ]
df_it_w <- df_it[df_it$subject == subject_v[j], ]
item_v <- sort(unique(df_per_w$items))
# 2 # . Loop1 over items
for (w in 1:length(item_v)){
sub_item_v <- sort(unique(df_per_w[df_per_w$items == item_v[w], 3]))
# 3 # Loop over subitems
for(z in 1:length(sub_item_v)){
dfcara_per <- df_per_w[df_per_w$items == item_v[w] & df_per_w$sub_items == sub_item_v[z],]
dfcara_it <- df_it_w[df_it_w$items == item_v[w] & df_it_w$sub_items == sub_item_v[z],]
# 4 # Loop over selected rows: subject[j], items[w], sub_items[z]
for (i in 1:nrow(dfcara_per)) {
# Checks if that combination of subject, item and subitem, existed the year before:
if(length(dfcara_it[dfcara_it$subject == dfcara_per[i,1] &
dfcara_it$items == dfcara_per[i,2] &
dfcara_it$sub_items == dfcara_per[i,3], 4]) != 0) {
func_df[y,1:5] <<- c(
dfcara_per[i,1:3],
sum(dfcara_per[i,4] -
dfcara_it[
dfcara_it$subject == dfcara_per[i,1] &
dfcara_it$items == dfcara_per[i,2] &
dfcara_it$sub_items == dfcara_per[i,3],
4]),
dfcara_per[i,5]
)
}else{
func_df[y,1:5] <<- func_df[i,1:5] # If there is no data in 2016, the function saves the amount of 2017
}
y <- y + 1
}
}
}
}
} # Function . End
It works but it takes soooo long, and i can't understand why an easy operation takes so long.
Thank you in advance!
Upvotes: 0
Views: 51
Reputation: 6191
Your problem is in the generation of your data. The construction of subject
, items
and sub_items
does not result in unique combinations of subject
, items
, and sub_items
so that violates your assumption that
There is no row duplicated,
I've tried to create another version of your data that has unique combinations of subject
, items
, and sub_items
using the expand.grid()
function.
DF2016 <- data.frame(expand.grid(1:195,1:65,1:39),
value = sample(1:100000000,(195*65*39)),
period = rep("2016",(195*65*39)))
DF2016 <- DF2016[sample(1:(195*65*39),450000),]
DF2017 <- data.frame(expand.grid(1:195,1:65,1:39),
value = sample(1:100000000,(195*65*39)),
period = rep("2017",(195*65*39)))
The variables Var1
, Var2
, and Var3
correspond to subject
, items
, and sub_items
.
You can then generate a left join to merge the two dataset
library("dplyr")
res <- DF2017 %>% left_join(DF2016, by=c("Var1", "Var2", "Var3")) %>%
mutate(difference = value.x-value.y)
The 2017 values are denoted value.x
and the values for 2016 are value.y
. This yields
> head(res)
Var1 Var2 Var3 value.x period.x value.y period.y difference
1 1 1 1 94920289 2017 84618631 2016 10301658
2 2 1 1 31008444 2017 87524572 2016 -56516128
3 3 1 1 44687050 2017 NA <NA> NA
4 4 1 1 87458715 2017 83105988 2016 4352727
5 5 1 1 40977802 2017 22528409 2016 18449393
6 6 1 1 80460053 2017 NA <NA> NA
Upvotes: 2