Reputation: 17870
When we externally merge sort a large file, we split it into small ones, sort those, and then merge them back into a large sorted file.
When merging, we can either do many 2-way merge passes, or one multi-way merge.
I am wondering which approach is better? and why?
Upvotes: 10
Views: 3358
Reputation: 11061
One multi-way merge is generally better. Consider three small files:
a1
a2
a3
and
b1
b2
b3
and finally
c1
c2
c3
If you do a merge with a
and b
, we're left with (say)
a1
b1
a2
b2
b3
a3
and
c1
c2
c3
A final merge would create the sorted list, but notice how in this final merge we have to visit the a
and b
items again. It's this re-merging that is wasteful in cascading two-way merges.
What you can do instead is a single multi-way merge. However, be careful how you do it. Specifically, avoid the naive double-loop that scans each cursor to see which has the minimum value. Use a min-heap instead. This will bring the complexity back down to O(n log n)
.
Upvotes: 7