Reputation: 370
I need to apply some sorting algorithm on blockwise variable length data. Here are the constraints:
Data Length is not fixed.
Block Size is fixed.
One block contains single/multiple data.
Consider I need external sorting. RAM can't hold the whole dataset. Like data set size is 20 GB. Where I can use up to 2GB RAM.
Example: For simplicity consider each element is space separated word in block.
Consider block size is 26 (including space) and the first block contains 5 element where 2nd block contains only 3 elements.
As blocks are in a fixed size, sorted data may contain more blocks than assorted.
Blocks:
[Harry Ron Draco Dark Lord] [Hermione Longbottom Voldemort]
Sorted Blocks:
[Dark Draco Harry Hermione] [Lord Longbottom Ron] [Voldemort]
Which algorithm/technique will be efficient for this scenario?
Upvotes: 1
Views: 232
Reputation: 28808
An external merge sort will work. The statement "sorted data may contain more blocks than assorted" implies data records don't span blocks, so the number of blocks may vary due to unused space in blocks during the sort and merge process. The first pass reads a set of blocks into memory, sorts the data records, then writes a sorted run of blocks to a file, repeating this process until all of the original data is processed. The remaining passes merge the files, until a single sorted file is produced. A k-way merge can be used for this process, with 2-way being the simplest. For k > 2, a minheap would help in finding which of k runs has the "smallest" next element. To reduce I/O overhead, read and write multiple blocks at a time.
Upvotes: 1
Reputation: 133950
My first inclination would be the following:
Write a script that reads the input file, removes the blocks and writes a sequential file. That is, take your file that contains:
[Harry Ron Draco Dark Lord] [Hermione Longbottom Voldemort]
And write this file:
Harry
Ron
Draco
Dark
Lord
Hermione
Longbottom
Voldemort
Then, use your system's sort utility (for example GNU sort
) to sort the file, giving:
Dark
Draco
Harry
Hermione
Lord
Longbottom
Ron
Voldemort
Then, write a script that reads through that file and constructs the blocks, writing them to the final output.
This almost certainly is not the most efficient in terms of running time, but it's simple, reliable, easy to code, and easy to prove correct. You can probably code it up and test with a subset of the data in an hour or two. Then set it to work on the entire data set.
Upvotes: 1