ordinary
ordinary

Reputation: 6133

How to sort K sorted arrays, with MERGE SORT

I know that this question has been asked, and there is a very nice elegant solution using a min heap.

MY question is how would one do this using the merge function of merge sort.

You already have an array of sorted arrays. So you should be able to merge all of them into one array in O(nlog K) time, correct?

I just can't figure out how to do this!

Say I have

[ [5,6], [3,4], [1,2], [0] ]

Step 1: [ [3,4,5,6], [0,1,2] ]

Step2: [ [0,1,2,3,4,5,6] ]

Is there a simple way to do this? Is O(nlog K) theoretically achievable with mergesort?

Upvotes: 6

Views: 23290

Answers (5)

arviman
arviman

Reputation: 5255

Just do it like a 2-way merge except with K items. Will result in O(NK). If you want O(N logK) you will need to use a min-heap to keep track of the K pointers(with source array as a metadata) in the algorithm below:

Keep an array of K elements - i.e K pointers showing position in each array. Mark all K elements are valid.

loop: Compare values in K pointers that are valid. if the value is minimum, select least index pointer and increment it into the next value in the array. If incremented value has crossed it's array, mark it invalid. Add the least value into the result. Repeat till all K elements are invalid.

For example,:

Positions        Arrays
p1:0  Array 1:  0  5  10  
p2:3  Array 2:  3  6   9
p3:2  Array 3:  2  4  6

Output (min of 0,3,2)=> 0. So output is {0}

      Array
p1:5    0  5  10
p2:3    3  6   9
p3:2    2  4  6

Output (min of 5,3,2)=> 2. So {0,2}

       Array
p1:5    0  5  10
p2:3    3  6  9
p3:4    2  4  6

Output (min of 5,3,4)=>3. So {0,2,3} ..and so on..until you come to a state where output is {0,2,3,4,5,6}

   Array
p1:5    0  5  10
p2:9    3  6  9
p3:6    2  4  6

Output (min of 5,9,6)=>6. So {0,2,3,4,5,6}+{6} when you mark p3 as "invalid" as you have exhausted the array. (or if you are using a min-heap you will simply remove the min-item, get it's source array metadata: in this case array 3, see that it's done so you will not add anything new to the min-heap)

Upvotes: 1

yinchuandong
yinchuandong

Reputation: 41

I implemented it in python. The main idea is similar to mergesort. There are k arrays in lists. In function mainMerageK, just divide lists (k) into left (k/2) and right (k/2). Therefore, the total count of partition is log(k). Regarding function merge, it is easy to know the runtime is O(n). Finally, we get O(nlog k) By the way, it also can be implemented in min heap, and there is a link: Merging K- Sorted Lists using Priority Queue

def mainMergeK(*lists):
    # implemented by k-way partition
    k = len(lists)
    if k > 1:
        mid = int(k / 2)
        B = mainMergeK(*lists[0: mid])
        C = mainMergeK(*lists[mid:])
        A = merge(B, C)
        print B, ' + ', C, ' = ', A
        return A
    return lists[0]

def merge(B, C):
    A = []
    p = len(B)
    q = len(C)
    i = 0
    j = 0
    while i < p and j < q:
        if B[i] <= C[j]:
            A.append(B[i])
            i += 1
        else:
            A.append(C[j])
            j += 1
    if i == p:
        for c in C[j:]:
            A.append(c)
    else:
        for b in B[i:]:
            A.append(b)
    return A

if __name__ == '__main__':
    x = mainMergeK([1, 3, 5], [2, 4, 6], [7, 8, 10], [9])
    print x

The output likes below:

[1, 3, 5]  +  [2, 4, 6]  =  [1, 2, 3, 4, 5, 6]
[7, 8, 10]  +  [9]  =  [7, 8, 9, 10]
[1, 2, 3, 4, 5, 6]  +  [7, 8, 9, 10]  =  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Upvotes: 2

Lirrik
Lirrik

Reputation: 815

You should note that when we say complexity is O(n log k), we assume that n means TOTAL number of elements in ALL of k arrays, i.e. number of elements in a final merged array.

For example, if you want to merge k arrays that contain n elements each, total number of elements in final array will be nk. So complexity will be O(nk log k).

Upvotes: 6

Jim Mischel
Jim Mischel

Reputation: 133975

As others have said, using the min heap to hold the next items is the optimal way. It's called an N-way merge. Its complexity is O(n log k).

You can use a 2-way merge algorithm to sort k arrays. Perhaps the easiest way is to modify the standard merge sort so that it uses non-constant partition sizes. For example, imagine that you have 4 arrays with lengths 10, 8, 12, and 33. Each array is sorted. If you concatenated the arrays into one, you would have these partitions (the numbers are indexes into the array, not values):

[0-9][10-17][18-29][30-62]

The first pass of your merge sort would have starting indexes of 0 and 10. You would merge that into a new array, just as you would with the standard merge sort. The next pass would start at positions 18 and 30 in the second array. When you're done with the second pass, your output array contains:

[0-17][18-62]

Now your partitions start at 0 and 18. You merge those two into a single array and you're done.

The only real difference is that rather than starting with a partition size of 2 and doubling, you have non-constant partition sizes. As you make each pass, the new partition size is the sum of the sizes of the two partitions you used in the previous pass. This really is just a slight modification of the standard merge sort.

It will take log(k) passes to do the sort, and at each pass you look at all n items. The algorithm is O(n log k), but with a much higher constant than the N-way merge.

For implementation, build an array of integers that contains the starting indexes of each of your sub arrays. So in the example above you would have:

int[] partitions = [0, 10, 18, 30];
int numPartitions = 4;

Now you do your standard merge sort. But you select your partitions from the partitions array. So your merge would start with:

merge (inputArray, outputArray, part1Index, part2Index, outputStart)
{
    part1Start = partitions[part1Index];
    part2Start = partitions[part2Index];

    part1Length = part2Start - part1Start;
    part2Length = partitions[part2Index-1] - part2Start;

    // now merge part1 and part2 into the output array,
    // starting at outputStart
}

And your main loop would look something like:

while (numPartitions > 1)
{
    for (int p = 0; p < numPartitions; p += 2)
    {
        outputStart = partitions[p];
        merge(inputArray, outputArray, p, p+1, outputStart);
        // update partitions table
        partitions[p/2] = partitions[p] + partitions[p+1];
    }
    numPartitions /= 2;
}

That's the basic idea. You'll have to do some work to handle the dangling partition when the number is odd, but in general that's how it's done.

You can also do it by maintaining an array of arrays, and merging each two arrays into a new array, adding that to an output array of arrays. Lather, rinse, repeat.

Upvotes: 13

Andrei Galatyn
Andrei Galatyn

Reputation: 3432

There different ways to merge arrays. To accoplish that task in N*Log(K) time you can use a structure called Heap (it is good structure to implement priority queue). I suppose that you already have it, if you don’t then pick up any available implementation: http://en.wikipedia.org/wiki/Heap_(data_structure) Then you can do that like this:

1.  We have A[1..K] array of arrays to sort, Head[1..K] - current pointer for every array and Count[1..K] - number of items for every array.
2.  We have Heap of pairs (Value: int; NumberOfArray: int) - empty at start.
3.  We put to the heap first item of every array - initialization phase.
4.  Then we organize cycle:
5.  Get pair (Value, NumberOfArray) from the heap. 
6.  Value is next value to output.
7.  NumberOfArray – is number of array where we need to take next item (if any) and place to the heap.
8.  If heap is not empty, then repeat from step 5

So for every item we operate only with heap built from K items as maximum. It mean that we will have N*Log(K) complexity as you asked.

Upvotes: 2

Related Questions