Reputation: 1500
The data:
Let X
be an array of labeled integers roughly in the domain [0, 2000000]
. The size |X|
is around 3000 elements. The labels are in the domain [A, B, C]
.
For example: [(13, A), (16, B), (32, A), (84, C), ...]
The constraints:
Every data point can be moved by ±50
on the axis, but the final ordering must respect the following criteria:
The goal:
There are 2 metrics to optimize with variable weight:
What I've tried:
I initially went for scipy.optimize.minimize()
but turns out that doesn't work on integer-only:
def stretch_optimization(initial_array: np.ndarray, weight_variance=1.0, weight_average=0.1):
ms_data = initial_array[:, 0]
def objective_function(ms_data):
gaps = np.diff(ms_data)
# Get gaps variance, we want the gaps to be as uniform as possible.
gaps_variance = np.var(gaps)
# Get gaps average, we want to maximize the gaps size.
gaps_average = np.mean(gaps)
return weight_variance * gaps_variance - weight_average * gaps_average
bounds = [(ms - 50, ms + 50) for ms, _ in initial_array]
def order_constraint(ms):
return np.diff(ms)
constraints = [{'type': 'ineq', 'fun': order_constraint}]
result: np.ndarray = minimize(objective_function, ms_data, bounds=bounds, constraints=constraints)
I'm not entirely sure where to go from there. I read that maybe I should use scipy.optimize.milp
(?) but it's not clear to me what I should do with that.
Upvotes: 1
Views: 65
Reputation: 195543
Maybe you can use pulp module:
Example with array of 20 variables (from 0 to 100, contstraint is +/-5):
import pulp
import numpy as np
np.random.seed(42)
# X = np.sort(np.random.randint(0, 2_000_000, size=3000))
X = np.sort(np.random.randint(0, 100, size=20))
prob = pulp.LpProblem("stretch", pulp.LpMinimize)
# amounts to add to each variable (to be computed):
variables = [
# pulp.LpVariable(f"x{i:04}", lowBound=-50, upBound=50, cat="Integer")
pulp.LpVariable(f"x{i:04}", lowBound=-5, upBound=5, cat="Integer")
for i, v in enumerate(X)
]
maximize_average_gap_size = pulp.LpVariable("maximize_average_gap_size")
minimize_variance = pulp.LpVariable("minimize_variance")
# out problem: miniminize variance and maximize gap size
prob += (minimize_variance * 1.0) - (maximize_average_gap_size * 0.1)
gaps = [
((X[idx] + variables[idx]) - (X[idx - 1] + variables[idx - 1]))
for idx in range(1, len(variables))
]
mean_gaps = pulp.lpSum(gaps) / len(gaps)
# define "maximize_average_gap_size"
prob += maximize_average_gap_size == mean_gaps
abs_minimize_variance = pulp.LpVariable("minimize_variance")
abs_vars = []
for i, g in enumerate(gaps):
abs_var = pulp.LpVariable(f"abs_var_{i:04}")
prob += (g - mean_gaps) <= abs_var
prob += -(g - mean_gaps) <= abs_var
abs_vars.append(abs_var)
# define "minimize_variance" (as sum of absolute values)
prob += minimize_variance == pulp.lpSum(abs_vars)
# constraint: every next value after adding has to be greater or equal than before
for idx in range(1, len(variables)):
prob += (X[idx - 1] + variables[idx - 1]) <= (X[idx] + variables[idx])
prob.solve()
# print(pulp.LpStatus[prob.status])
# for v in prob.variables():
# print(v.name, "=", v.varValue)
print("BEFORE:")
print(X)
gaps_before = np.diff(X)
print(f"{np.var(gaps_before)=}")
print(f"{np.mean(gaps_before)=}")
print("AFTER:")
after = [int(x + v.value()) for x, v in zip(X, variables)]
print(after)
gaps_after = np.diff(after)
print(f"{np.var(gaps_after)=}")
print(f"{np.mean(gaps_after)=}")
Prints:
Welcome to the CBC MILP Solver
Version: 2.10.3
Build Date: Dec 15 2019
...
BEFORE:
[ 1 2 14 20 21 23 29 37 51 52 60 71 74 74 82 86 87 87 92 99]
np.var(gaps_before)=np.float64(17.185595567867033)
np.mean(gaps_before)=np.float64(5.157894736842105)
AFTER:
[2, 7, 12, 18, 23, 28, 34, 40, 46, 54, 60, 66, 69, 74, 77, 82, 87, 92, 97, 102]
np.var(gaps_after)=np.float64(1.1412742382271468)
np.mean(gaps_after)=np.float64(5.2631578947368425)
Upvotes: 1