Reputation: 35
I have the following dataset which has the number of A, B and C both in 2000 and 2005. I now need to inpolate the dataframe subject to the constraint that the sum of A, B and C must be consistent with the Total amount, which is available for every year. Since I am dealing with population data, the values have to be integers.
data <- data.frame( Year = c(2000, 2001, 2002, 2003, 2004, 2005),
Total = c(50, 52, 53, 57, 60, 61),
A = c(12, NA, NA, NA, NA, 17),
B = c(22, NA, NA, NA, NA, 24),
C = c(16, NA, NA, NA, NA, 20) )
So far, I tried using na.approx()
from the zoo
package. But I was unable to make the "Total"-constraint work.
Thank you very much for your help.
Upvotes: 2
Views: 165
Reputation: 270045
1) We can solve this without rounding (as rounding may not guarantee that the constraints continue to hold) by defining the following quadratic integer programming problem with linear constraints. Define a least squares optimization problem based on this pseudo-code. The diff constraint ensures that each column is non-decreasing and the last constraint ensures that the first and last rows are not changed. (A variation of would be to use the sum of the absolute differences in the objective instead of the squared differences.)
ABC <- na.approx(data[-(1:2)])
Minimize sum((ABC - X)^2) over X
s.t. X is integer matrix and
rowSums(X) == data$Total and
diff(X) >= 0 and
X[c(1, nr), ] == ABC[c(1, nr), ]
Function opt
below implements this optimization. It has arguments ABC
which is the A, B and C columns of na.approx(data[-(1:2)])
and Total
which is the Total column of data
. Using CVXR (since the problem is convex) define the X
matrix to be solved for, the objective
, the constraints
and the problem
(which is the objective plus the constraints) and then use CVXR's solve
to solve that problem. If you prefer to use absolute differences rather than squared differences in the objective replace (ABC - X)^2
below with abs(ABC - X)
.
library(CVXR)
library(zoo)
opt <- function(ABC, Total) {
nr <- nrow(ABC); nc <- ncol(ABC)
X <- Variable(rows = nr, cols = nc, integer = TRUE)
objective <- Minimize(sum((ABC - X)^2))
constraints <- list(sum_entries(X, 1) == Total, diff(X) >= 0,
X[c(1, nr), ] == ABC[c(1, nr), ])
problem <- Problem(objective, constraints)
res <- solve(problem)
print(res$status)
res$getValue(X)
}
ABC <- na.approx(data[-(1:2)])
replace(data, colnames(ABC), opt(ABC, data$Total))
giving
[1] "optimal"
Year Total A B C
1 2000 50 12 22 16
2 2001 52 13 22 17
3 2002 53 14 22 17
4 2003 57 15 23 19
5 2004 60 16 24 20
6 2005 61 17 24 20
2) We could combine the other two answers with some elements of this one to get this approach. Unlike (1) we are not guaranteed a solution even if a feasible solution exists since the approach does not handle the diff
constraint other than to check afterwards whether it happened to be satisfied.
library(sfsmisc)
library(zoo)
opt2 <- function(ABC, Total, method = "offset-round") {
rowRoundfixS <- function(x) t(apply(x, 1, roundfixS, method = method))
ABC2 <- rowRoundfixS(proportions(ABC, 1) * Total)
nr <- nrow(ABC)
ok <- all(rowSums(ABC2) == Total) && all(diff(ABC2) >= 0) &&
all(ABC2[c(1, nr), ] == ABC[c(1, nr), colnames(ABC)])
print(if (ok) "ok" else "failed")
ABC2
}
ABC <- na.approx(data[-(1:2)])
replace(data, colnames(ABC), opt(ABC, data$Total))
giving
[1] ok
Year Total A B C
1 2000 50 12 22 16
2 2001 52 13 22 17
3 2002 53 14 22 17
4 2003 57 15 23 19
5 2004 60 16 24 20
6 2005 61 17 24 20
which at least for this data gives the same result as (1).
Input
data <- data.frame(
Year = seq(2000, 2005, by = 1),
Total = c(50, 52, 53, 57, 60, 61),
A = c(12, NA, NA, NA, NA, 17),
B = c(22, NA, NA, NA, NA, 24),
C = c(16, NA, NA, NA, NA, 20)
)
Upvotes: 2
Reputation: 19339
You could interpolate each one individually, as you did using zoo::na.approx
, and then use iterative proportional fitting (library ipfr) followed by optimized rounding using the roundfixS
function from the sfsmisc package.
Interpolation
nr <- nrow(data) # here it is 6, but may differ for the real data
mtx <- matrix(data = sapply(data[,3:5], na.approx),
nrow = nr, ncol = 3)[-c(1, nr),]
addmargins(mtx)
Sum
13 22.4 16.8 52.2
14 22.8 17.6 54.4
15 23.2 18.4 56.6
16 23.6 19.2 58.8
Sum 58 92.0 72.0 222.0
Note the row sums don't agree with your constraint.
Iterative proportional fitting
library(ipfr)
row_targets <- data$Total[2:5]
column_targets <- colSums(mtx)
X <- ipu_matrix(mtx, row_targets, column_targets)
addmargins(X)
Sum
12.94506 22.32052 16.73442 52
13.63439 22.21969 17.14592 53
15.10020 23.37088 18.52891 57
16.32035 24.08890 19.59075 60
Sum 58.00000 92.00000 72.00000 222
Row totals now agree with your constraints, but the numbers inside are not integers.
Optimized rounding
library(sfsmisc)
X_round <- roundfixS(X)
addmargins(X_round)
Sum
13 22 17 52
14 22 17 53
15 23 19 57
16 24 20 60
Sum 58 91 73 222
data <- data.frame( Year = c(2000, 2001, 2002, 2003, 2004, 2005),
Total = c(50, 52, 53, 57, 60, 61),
A = c(12, NA, NA, NA, NA, 17),
B = c(22, NA, NA, NA, NA, 24),
C = c(16, NA, NA, NA, NA, 20) )
Upvotes: 3
Reputation: 73562
You could first linearly interpolate columns using approx
, then use proportions(., margin=1)
, multiply with the totals and round
to integers.
> (proportions(sapply(lapply(data[, 3:5], approx, xout=seq_len(nrow(data))),
+ `[[`, "y"), margin=1)*data$Total) |>
+ round() |>
+ `mode<-`('integer') ## optional
A B C
[1,] 12 22 16
[2,] 13 22 17
[3,] 14 22 17
[4,] 15 23 19
[5,] 16 24 20
[6,] 17 24 20
Upvotes: 4