Sorting by simliarity

Question

I've got a collection of orders.

[a, b]
[a, b, c]
[a, b, c, d]
[a, b, c, d]
[b, c]
[c, d]

Where a, b, c and d are SKUs, and there are big boxes full of them. And there are thousands of orders and hundreds of possible SKUs.

Now imagine that when packing these orders, if an order lacks items from the previous order, you must put the box for that SKU away (and similarly take one out that you don't have).

How do you sort this so there are a minimum number of box changes? Or, in more programmy terms: how do you minimize the cumulative hamming distance / maximize the intersect between adjacent items in a collection?

I really have no clue where to start. Is there already some algorithm for this? Is there a decent approximation?

Gene · Accepted Answer

Indeed @irrelephant is correct. This is an undirected Hamiltonian path problem. Model it as a complete undirected graph where the nodes are sku sets and the weight of each edge is the Hamming distance between the respective sets. Then finding a packing order is equivalent to finding a path that touches each node exactly once. This is a Hamiltonian path (HP). You want the minimum weight HP.

The bad news is that finding a min weight HP is NP complete, which means an optimal solution will need exponential time in general.

The good news is that there are reasonable approximation algorithms. The obvious greedy algorithm gives an answer no worse than two times the optimal HP. It is:

create the graph of Hamming distances
sort the edges by weight in increasing order: e0, e1, ...
set C = emptyset
for e in sequence e0, e1, ...
   if C union {e} does not cause a cycle nor a vertex with degree more than 2 in C
      set C = C union {e}
return C

Note the if statement test can be implemented in nearly constant time with the classical disjoint set union-find algorithm and incident edge counters in vertices.

So the run time here can be O(n^2 log n) for n sku sets assuming that computing a Hamming distance is constant time.

If graphs are not in your vocabulary, think of a triangular table with one entry for each pair of sku sets. The entries in the table are Hamming distances. You want to sort the table entries and then add sku set pairs in sorted order one by one to your plan, skipping pairs that would cause a "fork" or a "loop." A fork would be a set of pairs like (a,b), (b,c), (b,d). A loop would be (a,b), (b,c), (c, a).

There are more complex polynomial time algorithms that get to a 3/2 approximation.

Sorting by simliarity

Answers (2)

Related Questions