Reputation: 22094
Given 100 GB integer Data on Hard Disk with RAM amounting to 2 GB, how to sort the integers with minimal disk operation. Here fetching one number from disk is considered as one disk operation( though in reality a block of data can be fetched).
We can use additional space on the disk for temporary storage and no need to consider the operations of cleaning up temporary spaces used.
Upvotes: 8
Views: 19140
Reputation: 2530
As other people have noted, you can use an O(n) counting sort. However there are some additional problems you need to consider. We'll assume you're storing 32-bit integers, so 100GB ~ 27e9 ints.
If all the integers are the same, then it will occur ~27e9 times, which is larger than a 32 bit int. Therefore your counters will have to be 64-bit integers.
With 2GB of RAM, you can only store ~125e6 counters in RAM at any one time. If we can't make any assumptions about the distribution of integers, we would either have to:
I think the latter option is better. Since we need ~4e9 64-bit counters and can only store 2GB, we would need to run through the entire array ~16 times. The first option is clearly no good if we consider encountering a sequence of integers such as 0,1<<31,0. These counters would not be stored in RAM at the same time, and thus at least 2 HD writes are required.
Because of this, I think for the specific size of your problem (100GB), an N-way merge sort would be better, as this would only require reading the entire array log_2(100) ~ 8 times.
However, if the interviewer immediately changed the question to "10TB array, still 2GB of RAM", then the counting sort would easily win.
Upvotes: 8
Reputation: 14870
I think that a for fast algorithm another 100GB of free HDD space are precondition.
Just use any sort on 2GB chunks and put them back. Now you have 50 sorted chunks in file, and you cand use the merge sort as suggested by Mihir on them. Write output buffer as it fills in the output file. You'll just have to fine-tune the input and output buffer sizes.
There are some solutions with counting. It cannot be used on such large range and maximum possible count. You could only store QWORD counters on disk, but this means many random accesses, that will certainly slower than working with larger buffers.
Upvotes: 3
Reputation: 5445
Since the data being sorted is integer type (4 bytes) and the amount of data is 100 GB (where GB is 2^30), you'd have 26,843,545,600 integers to sort. Since you have 4,294,967,296 possible integer values, you can represent this data as an array of longs which serve as counters, which would consume about 34 GB of disk space. Read through the 100 GB of data once, incrementing the individual counters for each possible integer value (300 GB total disk access to read the value, read the counter, write the incremented counter), then read the counters in order, writing out the number of values that you read of each value (134 GB total disk access).
This would sort the data using a total of 434 GB of disk access. If you use RAM to store part of the range of integer value counters, you could technically lower the amt of disk access even more.
Upvotes: 5
Reputation: 1256
To me the answer to this question depends cruically on the expected distribution of the numbers in the file.
There are 12.5 Billion ints in 100 Gigs of int data. There are also only ~4.3 Billion distinct ints.
Given a perfectly uniform distribution across all possible ints you'd expect each int to show up roughly 3 times give or take. This low level of duplication does not warrant changing from a standard sort routine (one that sorts chunks at a time and then merges the chunks together).
However, if we restrict the "file ints" to all be non-negative then we immediately expect each valid int to appear roughly 6 times. This is approaching a level of duplication that may warrant a change in sorting routine. So, I think you should ask the interviewer if anything more can be assumed about the distribution of ints on disk. After all, it would be odd to have 100GB worth of data and have no idea if it exhibits any predictable pattern.
Upvotes: 3
Reputation: 17131
100GB of integer data means you'll have a large number of duplicate data. I'd personally choose a (bucketsort/selection) / mergesort approach as my first instinct if I'm trying to minimize disk I/O.
First read a bit under 1 Gig of data into memory, mergesort that data in memory. Flush to disk. Repeat for each chunk of memory. Then you can walk each chunk of data and grab all the 0s, repeat for each integer. It's going to take a long time, but that's only 203GB Read and 200GB write worst case (theoretical).
Upvotes: 2
Reputation: 6539
Merge Sort is a popular approach when it comes to limited memory
Upvotes: 2