Mat
Mat

Reputation: 1500

How to optimally stretch and uniformize 1-dimensional integer points?

The data:

Let X be an array of labeled integers roughly in the domain [0, 2000000]. The size |X| is around 3000 elements. The labels are in the domain [A, B, C].

For example: [(13, A), (16, B), (32, A), (84, C), ...]


The constraints:

Every data point can be moved by ±50 on the axis, but the final ordering must respect the following criteria:


The goal:

There are 2 metrics to optimize with variable weight:

  1. The variance of the integers, to minimize. (Weighted strongly, say 1.0)
  2. The average distance of the integers, to maximize. (Weighted lightly, say 0.1)

What I've tried:

I initially went for scipy.optimize.minimize() but turns out that doesn't work on integer-only:

def stretch_optimization(initial_array: np.ndarray, weight_variance=1.0, weight_average=0.1):
    ms_data = initial_array[:, 0]
    
    def objective_function(ms_data):
        gaps = np.diff(ms_data)
    
        # Get gaps variance, we want the gaps to be as uniform as possible.
        gaps_variance = np.var(gaps)
    
        # Get gaps average, we want to maximize the gaps size.
        gaps_average = np.mean(gaps)
    
        return weight_variance * gaps_variance - weight_average * gaps_average
    
    bounds = [(ms - 50, ms + 50) for ms, _ in initial_array]
    
    def order_constraint(ms):
        return np.diff(ms)
    
    constraints = [{'type': 'ineq', 'fun': order_constraint}]
    
    result: np.ndarray = minimize(objective_function, ms_data, bounds=bounds, constraints=constraints)

I'm not entirely sure where to go from there. I read that maybe I should use scipy.optimize.milp (?) but it's not clear to me what I should do with that.

Upvotes: 1

Views: 65

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195543

Maybe you can use module:

Example with array of 20 variables (from 0 to 100, contstraint is +/-5):

import pulp
import numpy as np

np.random.seed(42)
# X = np.sort(np.random.randint(0, 2_000_000, size=3000))
X = np.sort(np.random.randint(0, 100, size=20))

prob = pulp.LpProblem("stretch", pulp.LpMinimize)

# amounts to add to each variable (to be computed):
variables = [
    # pulp.LpVariable(f"x{i:04}", lowBound=-50, upBound=50, cat="Integer")
    pulp.LpVariable(f"x{i:04}", lowBound=-5, upBound=5, cat="Integer")
    for i, v in enumerate(X)
]

maximize_average_gap_size = pulp.LpVariable("maximize_average_gap_size")
minimize_variance = pulp.LpVariable("minimize_variance")

# out problem: miniminize variance and maximize gap size
prob += (minimize_variance * 1.0) - (maximize_average_gap_size * 0.1)

gaps = [
    ((X[idx] + variables[idx]) - (X[idx - 1] + variables[idx - 1]))
    for idx in range(1, len(variables))
]
mean_gaps = pulp.lpSum(gaps) / len(gaps)

# define "maximize_average_gap_size"
prob += maximize_average_gap_size == mean_gaps

abs_minimize_variance = pulp.LpVariable("minimize_variance")

abs_vars = []
for i, g in enumerate(gaps):
    abs_var = pulp.LpVariable(f"abs_var_{i:04}")
    prob += (g - mean_gaps) <= abs_var
    prob += -(g - mean_gaps) <= abs_var
    abs_vars.append(abs_var)

# define "minimize_variance" (as sum of absolute values)
prob += minimize_variance == pulp.lpSum(abs_vars)

# constraint: every next value after adding has to be greater or equal than before
for idx in range(1, len(variables)):
    prob += (X[idx - 1] + variables[idx - 1]) <= (X[idx] + variables[idx])

prob.solve()

# print(pulp.LpStatus[prob.status])
# for v in prob.variables():
#     print(v.name, "=", v.varValue)

print("BEFORE:")
print(X)

gaps_before = np.diff(X)
print(f"{np.var(gaps_before)=}")
print(f"{np.mean(gaps_before)=}")

print("AFTER:")
after = [int(x + v.value()) for x, v in zip(X, variables)]
print(after)
gaps_after = np.diff(after)
print(f"{np.var(gaps_after)=}")
print(f"{np.mean(gaps_after)=}")

Prints:

Welcome to the CBC MILP Solver                                                                                                                                                                
Version: 2.10.3                                                                                                                                                                               
Build Date: Dec 15 2019 


... 

BEFORE:                                                                                        
[ 1  2 14 20 21 23 29 37 51 52 60 71 74 74 82 86 87 87 92 99]
np.var(gaps_before)=np.float64(17.185595567867033)                                                                                                                                            
np.mean(gaps_before)=np.float64(5.157894736842105)                                                                                                                                            
AFTER:                                                                                         
[2, 7, 12, 18, 23, 28, 34, 40, 46, 54, 60, 66, 69, 74, 77, 82, 87, 92, 97, 102]                                                                                                               
np.var(gaps_after)=np.float64(1.1412742382271468)      
np.mean(gaps_after)=np.float64(5.2631578947368425)                                        

Upvotes: 1

Related Questions