I'm looking to sort lists of 1 billion to 100 billion elements on systems with 8-128 cores, RAM for 10% of the elements, and with disks delivering 100-1000 MBytes/s. I have tested a simple merge sort, where each merge is performed in parallel by a CPU: sorted_part_a:__ \__[CPU.1]__ sorted_part_b:__/ \ \__[CPU.5]__ sorted_part_c:__ / \ \__[CPU.2]__/ \ sorted_part_d:__/ \ \__[CPU.7] sorted_part_e:__ / \__[CPU.3]__ / sorted_part_f:__/ \ / \__[CPU.6]__/ sorted_part_g:__ / \__[CPU.4]__/ sorted_part_h:__/ But that has the problem that the final merge step [ CPU.7 ] has to do n comparisons on a single core when merging the last two inputs, and comparisons can be expensive (think strings that have to respect locale setting). In my test [ CPU.7 ] was the bottleneck. I have then looked into Red-Black-trees. They have several advantages: when the tree is built, then getting a sorted list is O(n) with no comparisons. This avoids the bottleneck I saw in my merge sort test. you can build trees in parallel and merge them in parallel thus using multiple cores. you do not need all data before you can start building trees (so if you are reading from a slow device, you can sort while you are reading thus not wasting wall clock time). Saving the tree to disk also seems pretty easy (simply export the sorted list and the height of the tree), but getting only part of the tree back from disk seems more tricky. I have read Which parallel sorting algorithm has the best average case performance? but it seems to ignore the common case with medium sized data: That data fits on the server's disk, but it does not fit in RAM. Given the hardware (8-128 cores, RAM for 10% of the elements, and with disks delivering 100-1000 MBytes/s streaming, 1000 iops) what is the fastest way to sort lists of 10^9 to 100 * 10^9 elements of 10-100 bytes each? In layman's terms: What is the tried and true way to fast sort the biggest amount of data that you would sort on a single server?

algorithmperformancesortingparallel-processinglow-latency

Reputation: 33685

Fastest sort if data does not fit in physical RAM memory?

I'm looking to sort lists of 1 billion to 100 billion elements on systems with 8-128 cores, RAM for 10% of the elements, and with disks delivering 100-1000 MBytes/s.

I have tested a simple merge sort, where each merge is performed in parallel by a CPU:

sorted_part_a:__
                \__[CPU.1]__
sorted_part_b:__/           \
                             \__[CPU.5]__
sorted_part_c:__             /           \
                \__[CPU.2]__/             \
sorted_part_d:__/                          \
                                            \__[CPU.7]
sorted_part_e:__                            /
                \__[CPU.3]__               /
sorted_part_f:__/           \             /
                             \__[CPU.6]__/
sorted_part_g:__             /
                \__[CPU.4]__/
sorted_part_h:__/

But that has the problem that the final merge step [CPU.7] has to do n comparisons on a single core when merging the last two inputs, and comparisons can be expensive (think strings that have to respect locale setting). In my test [CPU.7] was the bottleneck.

I have then looked into Red-Black-trees. They have several advantages:

when the tree is built, then getting a sorted list is O(n) with no comparisons. This avoids the bottleneck I saw in my merge sort test.
you can build trees in parallel and merge them in parallel thus using multiple cores.
you do not need all data before you can start building trees (so if you are reading from a slow device, you can sort while you are reading thus not wasting wall clock time).

Saving the tree to disk also seems pretty easy (simply export the sorted list and the height of the tree), but getting only part of the tree back from disk seems more tricky.

I have read Which parallel sorting algorithm has the best average case performance? but it seems to ignore the common case with medium sized data: That data fits on the server's disk, but it does not fit in RAM.

Given the hardware (8-128 cores, RAM for 10% of the elements, and with disks delivering 100-1000 MBytes/s streaming, 1000 iops) what is the fastest way to sort lists of 10^9 to 100 * 10^9 elements of 10-100 bytes each?

In layman's terms:
What is the tried and true way to fast sort the biggest amount of data that you would sort on a single server?

Upvotes: 0

Answers (2)

Jim Mischel

Reputation: 133950

In the traditional merge, using sorted sub files, that final merge is O(n log k), where n is the total number of items and k is the number of sub files. Basically, you build a priority queue of the first items from each of the sorted sub files, remove the first item, write it out, and then insert the next item from the file that had the smallest item.

But you can parallelize that merge. Say you have 8 sub files. You can build a merge network like this:

    f1    f2    f3    f4    f5    f6    f7    f8
      \  /        \  /        \  /        \  /
       p1          p2          p3          p4
         \__    __/              \__    __/
            \  /                    \  /
             p5                      p6
                \_______    _______/
                        \  /
                         p7

The idea here being that each processor core p1 through p4 begins merging two files. Processors p5 and p6 each merge the output from two of the first-level processors, and p7 merges the results from them. p7 ends up doing n comparisons rather than the O(n log k) comparisons that it would have done if you were using a single CPU core for the merge.

Upvotes: 1

btilly

Reputation: 46389

I have never had to do this sort of thing when I didn't have custom built software to do the heavy lifting for me.

But the standard solution when I was at Google was to store your initial data in a distributed file system, do a distributed merge sort, and store the final data in a distributed file system. Since the final sorted data structure is stored in chunks, that means that even in the final pass each CPU only has to be doing comparisons within its chunk, allowing full CPU usage the whole way through.

For large data sets there is essentially never a use case where you want it in a single place at a single time where you have to iterate over the whole thing. To the contrary, imposing that arbitrary limitation just creates an unnecessary bottleneck.

Upvotes: 1

Fastest sort if data does not fit in physical RAM memory?

Answers (2)

Related Questions