Reputation: 2549
The original problem is like this:
You are to sort 1PB size of integers ranging from -2^31 ~ 2^31 - 1 (int), you have 1024 machines each having 1TB disk space and 16GB memory space. Assume disk speed is 128MB/s (r/w) and memory speed is 8GB/s (r/w). Time for CPU can be ignored. Network transfer time can be ignored for simplicity. Compute the approximated time needed.
I know with external sort we can sort the 1TB data on a single machine in roughly 10hrs as computed like this:
Disk access (2r2w): 1T * 4 / 128MB/s = 2 ^ 15 sec ~ 9 hrs
Mem access:
sorting 2^48 Integers in 64 parts (2 ^ 42 each) roughly takes 1.3 min each. So totally 1.4 hr.
63 way merging takes several seconds, and thus is ignored.
But what about the next step: the combination of 1024T data? I have no idea how this is computed. So any help please?
Upvotes: 1
Views: 505
Reputation: 7817
2^31 is = 2 billion (2 "giga"). So you are looking at lot of duplicate numbers and fixed range. So consider Radix Sort ( http://en.wikipedia.org/wiki/Radix_sort ).
Each processor, for a subset od data) creates 'count' array (x[0] contains the count of 0s etc). Then you can merge all results into one array. Later you can "construct" the sorted array.
Upvotes: 1