Reputation: 91
I want to implement an external Merge Sort in Scala. It is used to sort huge files which can not fit in the main memory in their entirety.
Details can be found here :- How external merge sort algorithm works?
Now, I need to read chunks of the file, sort it and write it to disk etc etc
What's the most idiomatic / functional way of reading/writing a huge file in parts ?
Upvotes: 0
Views: 595
Reputation: 17933
Chunked Sorting/Writing
Suppose you want to keep N numbers in memory at a time and further suppose that you are given a function which writes the N sorted numbers to a file:
val N : Int = ???
val writeToFile : Seq[Int] => Unit = ???
As indicated in your question, an Iterator can be used to only keep N numbers in RAM at a time to sort them and write them to the intermediate files:
val sourceFileName : String = ???
val sortAndWrite : Seq[Int] => Unit =
(_.sorted) andThen writeToFile
Source
.fromFile(sourceFileName)
.getLines
.map(_.toInt)
.grouped(N)
.foreach(sortAndWrite)
Now you have each group of N numbers in a different file. The only thing left to do is merge the files together.
Merging
Given some reading function that returns the Iterators from each of the sub-files:
val subFiles : Iterable[Iterator[String]] = ???
We can write a function that will return a new Iterator that gets the values from each file and sorts them:
val mergeSort : Iterable[Iterator[String]] => Iterator[Int] =
(fileIterators) => {
val nonEmptyFiles = fileIterators filter (_.hasNext)
nonEmptyFiles
.map(_.next)
.map(_.toInt)
.sorted
.toIterator ++ mergeSort(nonEmptyFiles)
}
Note: the above function will keep a single Integer
in memory for each file, therefore the RAM usage is dependent on how many different files that writeToFile
created.
Now just write the values to a file:
val destinationFileName : String = ???
val writer : Writer = new FileWriter(destinationFileName)
mergeSort(subFiles) foreach (i => writer write i.toString)
Imperfect Sorting
One thing to note: if N is small and the source file is not sufficiently random then the solution will not produce a perfect sort. Example: suppose N = 2
and the initial list was [10,11,0,1]
then the algorithm, after one pass, would produce [0,10,1,11]
as the result.
Upvotes: 2