Reputation: 400
I need your help in writing a efficient function to calculate a weighted difference. Preferrably with data.table as my dataset has a million records.
Here is a small df with two groups 1 and 2. It is ordered by layer from 1 to maximum 5 (it could also be less than 5 layers, e.g. group = 2 has only 3)
group <- c(1, 1, 1, 1, 1, 2, 2, 2)
layer <- c(1, 2, 3, 4, 5, 1, 2, 3)
value <- c(10.62, 10.61, 10.60, 10.40, 10.34, 15.40, 15.39, 15.35)
reference_value <- c(10.90, 10.90, 10.90, 10.90, 10.90, 15.70, 15.70, 15.70)
size <- c(30, 120, 250, 40, 20, 70, 80, 90)
cumsum_size <- c(30, 150, 400, 440, 460, 70, 150, 240)
df <- data.frame(group, layer, value, reference_value, size, cumsum_size)
I need to calculate a weighted difference to reference_value column for different target sizes for every group. Say, I have the following
target_sizes <- c(20, 50, 200, 500)
To calculate the weighted diff the target_size needs to be compared to the cumsum_size to decide at which layer it will be, so, for df$group == 1 the calculation should be:
Case 1
target size 20
# it stays on layer 1 as 20 < 30
weighted impact is
abs(10.62-10.90)*20
Case 2
# target size 50
it takes 30 from layer 1 and additional 20 from layer 2
weighted impact is
abs(10.62 - 10.90)*30 + abs(10.61 - 10.90)*20
Case 3
# target size 200
it takes 30 from layer 1, 120 from layer 2 and 50 from layer 3
weighted impact is
abs(10.62 - 10.90)*30 + abs(10.61 - 10.90)*120 + abs(10.60 - 10.90)*50
Case 4
# target size 500
as the cumsum_size in the last layer 5 (460) less than 500
it should return NA
Upvotes: 0
Views: 96
Reputation: 25225
Here is one possible approach.
#since abs diff never changes, calculate it first
DT[, absDiff := abs(value - reference_value)]
#merge the unique set of group & cumsum_size with DT
dat <- merge(DT, DT[,CJ(isTarg=TRUE, group=unique(group), cumsum_size=target_sizes)],
by=c("group", "cumsum_size"), all=TRUE)
#backfill the absDiff using zoo::na.locf
dat[, absDiff := na.locf(absDiff, na.rm=FALSE, fromLast=TRUE), by=.(group)]
#calculate the desired results by group and filter for chosen target_sizes
dat[, result := cumsum(diff(c(0, cumsum_size)) * absDiff), by=.(group)][(isTarg)]
output:
group cumsum_size layer value reference_value size absDiff isTarg result
1: 1 20 NA NA NA NA 0.28 TRUE 5.6
2: 1 50 NA NA NA NA 0.29 TRUE 14.2
3: 1 200 NA NA NA NA 0.30 TRUE 58.2
4: 1 500 NA NA NA NA NA TRUE NA
5: 2 20 NA NA NA NA 0.30 TRUE 6.0
6: 2 50 NA NA NA NA 0.30 TRUE 15.0
7: 2 200 NA NA NA NA 0.35 TRUE 63.3
8: 2 500 NA NA NA NA NA TRUE NA
data and packages:
library(data.table)
library(zoo)
group <- c(1, 1, 1, 1, 1, 2, 2, 2)
layer <- c(1, 2, 3, 4, 5, 1, 2, 3)
value <- c(10.62, 10.61, 10.60, 10.40, 10.34, 15.40, 15.39, 15.35)
reference_value <- c(10.90, 10.90, 10.90, 10.90, 10.90, 15.70, 15.70, 15.70)
size <- c(30, 120, 250, 40, 20, 70, 80, 90)
cumsum_size <- c(30, 150, 400, 440, 460, 70, 150, 240)
DT <- data.table(group, layer, value, reference_value, size, cumsum_size)
target_sizes <- c(20, 50, 200, 500)
Upvotes: 1