Reputation: 61
Suppose you're given a merge function that will merge (find the union of) two lists L1 and L2 of size s1 and s2 in O(s1+s2) time. What is the optimal way to merge k lists of size s1, s2, ..., sk?
I'm thinking that we should sort s1, ..., sk initially and sort the first two lists that correspond to the lowest two sizes. When these are merged, we find the position of their size in the sorted size list and continue the process until we end up with one list.
I'm having trouble with two things: 1. whether this is indeed optimal (is there another method that will return in faster time)? 2. how do we analyze the running time when the list size changes when we merge?
Upvotes: 1
Views: 1409
Reputation: 241671
This is precisely the same problem as finding the optimal variable-length bit-encoding for a string formed from an alphabet of k
symbols with known frequencies s1, s2, … sk
. Your algorithm is precisely the Huffman algorithm, and you'll find a proof of optimality in any textbook on algorithms (and many online resources) because it is the classic case of a greedy algorithm with a simple correctness proof.
The repeated application of a two-way merge induces a binary tree in which each node is a merge. Given that tree, the contribution of any leaf to the total cost of the overall merge is the weight of that leaf multiplied by its depth in the tree. (Each node is a merge, and the values in the leaf participate exactly in the merges in the path from the leaf to the root; the number of such merges is the depth of the leaf in the tree.) Similarly -- or identically --, the total length of a Huffman-encoded bitstring is the sum of the products of the weight (frequency) of symbol with the depth of the leaf corresponding to that symbol in the construction tree.
One small improvement to your algorithm (which is often missed by people writing Huffman tree builders): it is necessary to sort the weights s1, s2, … sk
, but that's the only sort needed. From there, the algorithm always selects the two lowest nodes and adds them. The resulting sums must be monotonically non-decreasing in size (if a sum is smaller than a previous sum, the previous sum could not have been the sum of the two smallest elements). So you can put the sums in a queue; at each step you choose the two smallest elements from either the sorted array of leaves or the (implicitly) sorted queue of nodes.
This can be optimized further by overwriting the array of leaves with the queue of nodes. (The queue then grows from the bottom of the array towards the top; it's fairly simple to prove that the top of the queue will never overtake the bottom of the array.)
Upvotes: 2